Time-Series Forecasting for Walmart Dataset¶
Data Section¶
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import plotly.express as px
import plotly.graph_objects as go
from statsmodels.tsa.seasonal import seasonal_decompose
%matplotlib inline
from statsmodels.tsa.arima_model import ARMA,ARMAResults,ARIMA,ARIMAResults
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf # for determining (p,q) orders
from pmdarima import auto_arima # for determining ARIMA orders
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.metrics import r2_score,mean_squared_error
import xgboost as xgb
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
# Display floats with two decimal places
# pd.set_option('display.float_format', '{:.2f}'.format)
# !pip install pmdarima
# !pip install xgboost
calendar_df = pd.read_csv('C:/Users/97610/Downloads/m5-forecasting-accuracy/calendar.csv')
calendar_df.info
<bound method DataFrame.info of date wm_yr_wk weekday wday month year d \
0 2011-01-29 11101 Saturday 1 1 2011 d_1
1 2011-01-30 11101 Sunday 2 1 2011 d_2
2 2011-01-31 11101 Monday 3 1 2011 d_3
3 2011-02-01 11101 Tuesday 4 2 2011 d_4
4 2011-02-02 11101 Wednesday 5 2 2011 d_5
... ... ... ... ... ... ... ...
1964 2016-06-15 11620 Wednesday 5 6 2016 d_1965
1965 2016-06-16 11620 Thursday 6 6 2016 d_1966
1966 2016-06-17 11620 Friday 7 6 2016 d_1967
1967 2016-06-18 11621 Saturday 1 6 2016 d_1968
1968 2016-06-19 11621 Sunday 2 6 2016 d_1969
event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX \
0 NaN NaN NaN NaN 0 0
1 NaN NaN NaN NaN 0 0
2 NaN NaN NaN NaN 0 0
3 NaN NaN NaN NaN 1 1
4 NaN NaN NaN NaN 1 0
... ... ... ... ... ... ...
1964 NaN NaN NaN NaN 0 1
1965 NaN NaN NaN NaN 0 0
1966 NaN NaN NaN NaN 0 0
1967 NaN NaN NaN NaN 0 0
1968 NBAFinalsEnd Sporting Father's day Cultural 0 0
snap_WI
0 0
1 0
2 0
3 0
4 1
... ...
1964 1
1965 0
1966 0
1967 0
1968 0
[1969 rows x 14 columns]>
calendar_df.head()
| date | wm_yr_wk | weekday | wday | month | year | d | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | snap_TX | snap_WI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-29 | 11101 | Saturday | 1 | 1 | 2011 | d_1 | NaN | NaN | NaN | NaN | 0 | 0 | 0 |
| 1 | 2011-01-30 | 11101 | Sunday | 2 | 1 | 2011 | d_2 | NaN | NaN | NaN | NaN | 0 | 0 | 0 |
| 2 | 2011-01-31 | 11101 | Monday | 3 | 1 | 2011 | d_3 | NaN | NaN | NaN | NaN | 0 | 0 | 0 |
| 3 | 2011-02-01 | 11101 | Tuesday | 4 | 2 | 2011 | d_4 | NaN | NaN | NaN | NaN | 1 | 1 | 0 |
| 4 | 2011-02-02 | 11101 | Wednesday | 5 | 2 | 2011 | d_5 | NaN | NaN | NaN | NaN | 1 | 0 | 1 |
calendar_df.tail()
| date | wm_yr_wk | weekday | wday | month | year | d | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | snap_TX | snap_WI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1964 | 2016-06-15 | 11620 | Wednesday | 5 | 6 | 2016 | d_1965 | NaN | NaN | NaN | NaN | 0 | 1 | 1 |
| 1965 | 2016-06-16 | 11620 | Thursday | 6 | 6 | 2016 | d_1966 | NaN | NaN | NaN | NaN | 0 | 0 | 0 |
| 1966 | 2016-06-17 | 11620 | Friday | 7 | 6 | 2016 | d_1967 | NaN | NaN | NaN | NaN | 0 | 0 | 0 |
| 1967 | 2016-06-18 | 11621 | Saturday | 1 | 6 | 2016 | d_1968 | NaN | NaN | NaN | NaN | 0 | 0 | 0 |
| 1968 | 2016-06-19 | 11621 | Sunday | 2 | 6 | 2016 | d_1969 | NBAFinalsEnd | Sporting | Father's day | Cultural | 0 | 0 | 0 |
calendar_df["date"] = pd.to_datetime(calendar_df["date"])
total_days = len(calendar_df)
event_days = calendar_df[['event_name_1', 'event_name_2']].dropna(how='all').shape[0]
event_percentage = (event_days / total_days) * 100
no_event_percentage = 100 - event_percentage
labels = ['Days with Events', 'Days without Events']
sizes = [event_percentage, no_event_percentage]
plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, wedgeprops={'edgecolor': 'black'})
plt.title('Percentage of Days with Events')
plt.show()
event_counts = pd.concat([calendar_df['event_name_1'], calendar_df['event_name_2']]).value_counts()
print("Event Name Counts:\n", event_counts)
Event Name Counts: SuperBowl 6 Cinco De Mayo 6 Ramadan starts 6 ValentinesDay 6 Father's day 6 NBAFinalsEnd 6 NBAFinalsStart 6 MemorialDay 6 Mother's day 6 Easter 6 Pesach End 6 Purim End 6 StPatricksDay 6 LentWeek2 6 LentStart 6 PresidentsDay 6 OrthodoxEaster 6 Thanksgiving 5 MartinLutherKingDay 5 OrthodoxChristmas 5 NewYear 5 Chanukah End 5 Christmas 5 Halloween 5 VeteransDay 5 EidAlAdha 5 ColumbusDay 5 LaborDay 5 Eid al-Fitr 5 IndependenceDay 5 Name: count, dtype: int64
calendar_df[['year', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']].groupby("year").count()
| event_name_1 | event_type_1 | event_name_2 | event_type_2 | |
|---|---|---|---|---|
| year | ||||
| 2011 | 26 | 26 | 1 | 1 |
| 2012 | 30 | 30 | 0 | 0 |
| 2013 | 29 | 29 | 1 | 1 |
| 2014 | 28 | 28 | 2 | 2 |
| 2015 | 30 | 30 | 0 | 0 |
| 2016 | 19 | 19 | 1 | 1 |
event_type_counts = pd.concat([calendar_df['event_type_1'], calendar_df['event_type_2']]).value_counts()
print("Event Type Counts:\n", event_type_counts)
Event Type Counts: Religious 56 National 52 Cultural 41 Sporting 18 Name: count, dtype: int64
unique_event_types = event_type_counts.index
colors = sns.color_palette("Paired", len(unique_event_types))
plt.figure(figsize=(10, 6))
event_type_counts.plot(kind='bar', color=colors, edgecolor='black')
plt.xlabel("Event Type")
plt.ylabel("Count")
plt.title("Distribution of Each Event Category")
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
calendar_df['snap_CA'] = calendar_df['snap_CA'].astype(bool)
calendar_df['snap_TX'] = calendar_df['snap_TX'].astype(bool)
calendar_df['snap_WI'] = calendar_df['snap_WI'].astype(bool)
snap_melted = calendar_df.melt(value_vars=['snap_CA', 'snap_TX', 'snap_WI'],
var_name='State',
value_name='SNAP_Purchase')
snap_counts = snap_melted.groupby(['State', 'SNAP_Purchase']).size().reset_index(name='Count')
snap_counts['Percentage'] = snap_counts.groupby('State')['Count'].transform(lambda x: x / x.sum() * 100)
g = sns.catplot(data=snap_counts, x='SNAP_Purchase', y='Percentage', col='State', kind='bar',palette="Paired",
col_order=['snap_CA', 'snap_TX', 'snap_WI'], height=3, aspect=1)
g.set_axis_labels("", "%")
g.set_titles(col_template="{col_name}")
plt.suptitle("Days with SNAP purchases for all states", fontsize=12)
plt.show()
sales_df = pd.read_csv('C:/Users/97610/Downloads/m5-forecasting-accuracy/sales_train_evaluation.csv')
sales_df.info
<bound method DataFrame.info of id item_id dept_id cat_id \
0 HOBBIES_1_001_CA_1_evaluation HOBBIES_1_001 HOBBIES_1 HOBBIES
1 HOBBIES_1_002_CA_1_evaluation HOBBIES_1_002 HOBBIES_1 HOBBIES
2 HOBBIES_1_003_CA_1_evaluation HOBBIES_1_003 HOBBIES_1 HOBBIES
3 HOBBIES_1_004_CA_1_evaluation HOBBIES_1_004 HOBBIES_1 HOBBIES
4 HOBBIES_1_005_CA_1_evaluation HOBBIES_1_005 HOBBIES_1 HOBBIES
... ... ... ... ...
30485 FOODS_3_823_WI_3_evaluation FOODS_3_823 FOODS_3 FOODS
30486 FOODS_3_824_WI_3_evaluation FOODS_3_824 FOODS_3 FOODS
30487 FOODS_3_825_WI_3_evaluation FOODS_3_825 FOODS_3 FOODS
30488 FOODS_3_826_WI_3_evaluation FOODS_3_826 FOODS_3 FOODS
30489 FOODS_3_827_WI_3_evaluation FOODS_3_827 FOODS_3 FOODS
store_id state_id d_1 d_2 d_3 d_4 ... d_1932 d_1933 d_1934 \
0 CA_1 CA 0 0 0 0 ... 2 4 0
1 CA_1 CA 0 0 0 0 ... 0 1 2
2 CA_1 CA 0 0 0 0 ... 1 0 2
3 CA_1 CA 0 0 0 0 ... 1 1 0
4 CA_1 CA 0 0 0 0 ... 0 0 0
... ... ... ... ... ... ... ... ... ... ...
30485 WI_3 WI 0 0 2 2 ... 1 0 3
30486 WI_3 WI 0 0 0 0 ... 0 0 0
30487 WI_3 WI 0 6 0 2 ... 0 0 1
30488 WI_3 WI 0 0 0 0 ... 1 1 1
30489 WI_3 WI 0 0 0 0 ... 1 2 0
d_1935 d_1936 d_1937 d_1938 d_1939 d_1940 d_1941
0 0 0 0 3 3 0 1
1 1 1 0 0 0 0 0
2 0 0 0 2 3 0 1
3 4 0 1 3 0 2 6
4 2 1 0 0 2 1 0
... ... ... ... ... ... ... ...
30485 0 1 1 0 0 1 1
30486 0 0 0 1 0 1 0
30487 2 0 1 0 1 0 2
30488 4 6 0 1 1 1 0
30489 5 4 0 2 2 5 1
[30490 rows x 1947 columns]>
sales_df.head()
| id | item_id | dept_id | cat_id | store_id | state_id | d_1 | d_2 | d_3 | d_4 | ... | d_1932 | d_1933 | d_1934 | d_1935 | d_1936 | d_1937 | d_1938 | d_1939 | d_1940 | d_1941 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HOBBIES_1_001_CA_1_evaluation | HOBBIES_1_001 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 | ... | 2 | 4 | 0 | 0 | 0 | 0 | 3 | 3 | 0 | 1 |
| 1 | HOBBIES_1_002_CA_1_evaluation | HOBBIES_1_002 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 | ... | 0 | 1 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | HOBBIES_1_003_CA_1_evaluation | HOBBIES_1_003 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 | ... | 1 | 0 | 2 | 0 | 0 | 0 | 2 | 3 | 0 | 1 |
| 3 | HOBBIES_1_004_CA_1_evaluation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 4 | 0 | 1 | 3 | 0 | 2 | 6 |
| 4 | HOBBIES_1_005_CA_1_evaluation | HOBBIES_1_005 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 2 | 1 | 0 |
5 rows × 1947 columns
sales_df.tail()
| id | item_id | dept_id | cat_id | store_id | state_id | d_1 | d_2 | d_3 | d_4 | ... | d_1932 | d_1933 | d_1934 | d_1935 | d_1936 | d_1937 | d_1938 | d_1939 | d_1940 | d_1941 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30485 | FOODS_3_823_WI_3_evaluation | FOODS_3_823 | FOODS_3 | FOODS | WI_3 | WI | 0 | 0 | 2 | 2 | ... | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
| 30486 | FOODS_3_824_WI_3_evaluation | FOODS_3_824 | FOODS_3 | FOODS | WI_3 | WI | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 30487 | FOODS_3_825_WI_3_evaluation | FOODS_3_825 | FOODS_3 | FOODS | WI_3 | WI | 0 | 6 | 0 | 2 | ... | 0 | 0 | 1 | 2 | 0 | 1 | 0 | 1 | 0 | 2 |
| 30488 | FOODS_3_826_WI_3_evaluation | FOODS_3_826 | FOODS_3 | FOODS | WI_3 | WI | 0 | 0 | 0 | 0 | ... | 1 | 1 | 1 | 4 | 6 | 0 | 1 | 1 | 1 | 0 |
| 30489 | FOODS_3_827_WI_3_evaluation | FOODS_3_827 | FOODS_3 | FOODS | WI_3 | WI | 0 | 0 | 0 | 0 | ... | 1 | 2 | 0 | 5 | 4 | 0 | 2 | 2 | 5 | 1 |
5 rows × 1947 columns
cols = sales_df.dtypes.index.tolist()
types = sales_df.dtypes.values.tolist()
sales_df.dtypes.index, sales_df.dtypes.values
(Index(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'd_1',
'd_2', 'd_3', 'd_4',
...
'd_1932', 'd_1933', 'd_1934', 'd_1935', 'd_1936', 'd_1937', 'd_1938',
'd_1939', 'd_1940', 'd_1941'],
dtype='object', length=1947),
array([dtype('O'), dtype('O'), dtype('O'), ..., dtype('int64'),
dtype('int64'), dtype('int64')], dtype=object))
def downcast(df):
cols = df.dtypes.index.tolist()
types = df.dtypes.values.tolist()
for i, t in enumerate(types):
if 'int' in str(t):
if df[cols[i]].min() > np.iinfo(np.int8).min and df[cols[i]].max() < np.iinfo(np.int8).max:
df[cols[i]] = df[cols[i]].astype(np.int8)
elif df[cols[i]].min() > np.iinfo(np.int16).min and df[cols[i]].max() < np.iinfo(np.int16).max:
df[cols[i]] = df[cols[i]].astype(np.int16)
elif df[cols[i]].min() > np.iinfo(np.int32).min and df[cols[i]].max() < np.iinfo(np.int32).max:
df[cols[i]] = df[cols[i]].astype(np.int32)
else:
df[cols[i]] = df[cols[i]].astype(np.int64)
elif 'float' in str(t):
if df[cols[i]].min() > np.finfo(np.float16).min and df[cols[i]].max() < np.finfo(np.float16).max:
df[cols[i]] = df[cols[i]].astype(np.float16)
elif df[cols[i]].min() > np.finfo(np.float32).min and df[cols[i]].max() < np.finfo(np.float32).max:
df[cols[i]] = df[cols[i]].astype(np.float32)
else:
df[cols[i]] = df[cols[i]].astype(np.float64)
elif t == object:
if cols[i] == 'date':
df[cols[i]] = pd.to_datetime(df[cols[i]], format='%Y-%m-%d')
else:
df[cols[i]] = df[cols[i]].astype('category')
return df
sales_df = downcast(sales_df)
prices = pd.read_csv('C:/Users/97610/Downloads/m5-forecasting-accuracy/sell_prices.csv')
prices.head()
| store_id | item_id | wm_yr_wk | sell_price | |
|---|---|---|---|---|
| 0 | CA_1 | HOBBIES_1_001 | 11325 | 9.58 |
| 1 | CA_1 | HOBBIES_1_001 | 11326 | 9.58 |
| 2 | CA_1 | HOBBIES_1_001 | 11327 | 8.26 |
| 3 | CA_1 | HOBBIES_1_001 | 11328 | 8.26 |
| 4 | CA_1 | HOBBIES_1_001 | 11329 | 8.26 |
prices.tail()
| store_id | item_id | wm_yr_wk | sell_price | |
|---|---|---|---|---|
| 6841116 | WI_3 | FOODS_3_827 | 11617 | 1.0 |
| 6841117 | WI_3 | FOODS_3_827 | 11618 | 1.0 |
| 6841118 | WI_3 | FOODS_3_827 | 11619 | 1.0 |
| 6841119 | WI_3 | FOODS_3_827 | 11620 | 1.0 |
| 6841120 | WI_3 | FOODS_3_827 | 11621 | 1.0 |
prices = downcast(prices)
df = pd.melt(sales_df, id_vars=['id','item_id','dept_id','cat_id','store_id','state_id'],
var_name='d',value_name='unit_sold').dropna()
df.head()
| id | item_id | dept_id | cat_id | store_id | state_id | d | unit_sold | |
|---|---|---|---|---|---|---|---|---|
| 0 | HOBBIES_1_001_CA_1_evaluation | HOBBIES_1_001 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 |
| 1 | HOBBIES_1_002_CA_1_evaluation | HOBBIES_1_002 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 |
| 2 | HOBBIES_1_003_CA_1_evaluation | HOBBIES_1_003 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 |
| 3 | HOBBIES_1_004_CA_1_evaluation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 |
| 4 | HOBBIES_1_005_CA_1_evaluation | HOBBIES_1_005 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 |
df = pd.merge(df, calendar_df, on='d', how='left')
df.head()
| id | item_id | dept_id | cat_id | store_id | state_id | d | unit_sold | date | wm_yr_wk | ... | wday | month | year | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | snap_TX | snap_WI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HOBBIES_1_001_CA_1_evaluation | HOBBIES_1_001 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False |
| 1 | HOBBIES_1_002_CA_1_evaluation | HOBBIES_1_002 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False |
| 2 | HOBBIES_1_003_CA_1_evaluation | HOBBIES_1_003 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False |
| 3 | HOBBIES_1_004_CA_1_evaluation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False |
| 4 | HOBBIES_1_005_CA_1_evaluation | HOBBIES_1_005 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False |
5 rows × 21 columns
df.head()
| id | item_id | dept_id | cat_id | store_id | state_id | d | unit_sold | date | wm_yr_wk | ... | wday | month | year | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | snap_TX | snap_WI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HOBBIES_1_001_CA_1_evaluation | HOBBIES_1_001 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False |
| 1 | HOBBIES_1_002_CA_1_evaluation | HOBBIES_1_002 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False |
| 2 | HOBBIES_1_003_CA_1_evaluation | HOBBIES_1_003 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False |
| 3 | HOBBIES_1_004_CA_1_evaluation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False |
| 4 | HOBBIES_1_005_CA_1_evaluation | HOBBIES_1_005 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False |
5 rows × 21 columns
df = pd.merge(df, prices, on=['store_id','item_id','wm_yr_wk'], how='left')
df.head()
| id | item_id | dept_id | cat_id | store_id | state_id | d | unit_sold | date | wm_yr_wk | ... | month | year | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | snap_TX | snap_WI | sell_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HOBBIES_1_001_CA_1_evaluation | HOBBIES_1_001 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False | NaN |
| 1 | HOBBIES_1_002_CA_1_evaluation | HOBBIES_1_002 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False | NaN |
| 2 | HOBBIES_1_003_CA_1_evaluation | HOBBIES_1_003 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False | NaN |
| 3 | HOBBIES_1_004_CA_1_evaluation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False | NaN |
| 4 | HOBBIES_1_005_CA_1_evaluation | HOBBIES_1_005 | HOBBIES_1 | HOBBIES | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 1 | 2011 | NaN | NaN | NaN | NaN | False | False | False | NaN |
5 rows × 22 columns
df_grouped = df.groupby("dept_id")["unit_sold"].sum().reset_index()
print(df_grouped)
dept_id unit_sold 0 FOODS_1 5190400 1 FOODS_2 7795025 2 FOODS_3 32937002 3 HOBBIES_1 5699014 4 HOBBIES_2 541642 5 HOUSEHOLD_1 11722853 6 HOUSEHOLD_2 3041237
plt.figure(figsize=(12, 6))
plt.bar(df_grouped["dept_id"], df_grouped["unit_sold"], color='skyblue', edgecolor='black', alpha=0.7)
plt.xlabel("Department ID")
plt.ylabel("Total Units Sold")
plt.title("Total Units Sold per Department")
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
df["sell_price"] = df["sell_price"].fillna(0)
df["revenue"] = df["unit_sold"] * df["sell_price"]
df_grouped_revenue = df.groupby("dept_id")["revenue"].sum().reset_index()
plt.figure(figsize=(12, 6))
plt.bar(df_grouped_revenue["dept_id"], df_grouped_revenue["revenue"], color='lightcoral', edgecolor='black', alpha=0.7)
plt.xlabel("Department ID")
plt.ylabel("Total Revenue")
plt.title("Total Revenue per Department")
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Foods3 = df[df['dept_id'] == 'FOODS_3'].copy()
Foods3.head()
| id | item_id | dept_id | cat_id | store_id | state_id | d | unit_sold | date | wm_yr_wk | ... | year | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | snap_TX | snap_WI | sell_price | revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2226 | FOODS_3_001_CA_1_evaluation | FOODS_3_001 | FOODS_3 | FOODS | CA_1 | CA | d_1 | 1 | 2011-01-29 | 11101 | ... | 2011 | NaN | NaN | NaN | NaN | False | False | False | 2.279297 | 2.279297 |
| 2227 | FOODS_3_002_CA_1_evaluation | FOODS_3_002 | FOODS_3 | FOODS | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 2011 | NaN | NaN | NaN | NaN | False | False | False | 0.000000 | 0.000000 |
| 2228 | FOODS_3_003_CA_1_evaluation | FOODS_3_003 | FOODS_3 | FOODS | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 2011 | NaN | NaN | NaN | NaN | False | False | False | 0.000000 | 0.000000 |
| 2229 | FOODS_3_004_CA_1_evaluation | FOODS_3_004 | FOODS_3 | FOODS | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 2011 | NaN | NaN | NaN | NaN | False | False | False | 0.000000 | 0.000000 |
| 2230 | FOODS_3_005_CA_1_evaluation | FOODS_3_005 | FOODS_3 | FOODS | CA_1 | CA | d_1 | 1 | 2011-01-29 | 11101 | ... | 2011 | NaN | NaN | NaN | NaN | False | False | False | 1.980469 | 1.980469 |
5 rows × 23 columns
sns.boxplot(data=Foods3, x='state_id', y='unit_sold')
plt.xlabel('State ID')
plt.ylabel('Units Sold')
plt.title('Boxplot of Foods3 Units Sold by State')
plt.show()
for col in ['state_id', 'store_id', 'dept_id']:
Foods3[col] = Foods3[col].astype('category')
group = Foods3.groupby(['year', 'date', 'state_id', 'store_id'], as_index=False)['unit_sold'].sum()
fig = px.violin(group, x='store_id', color='state_id', y='unit_sold', box=True)
fig.update_xaxes(title_text='Store')
fig.update_yaxes(title_text='Total Foods3 items sold')
fig.update_layout(template='seaborn', title='Distribution of Foods3 Sold by stores', legend_title_text='State')
fig.show()
total_time_series = Foods3.groupby('date', as_index=False)['unit_sold'].sum()
# Create a line plot for total Foods3 sales over time
fig = px.line(total_time_series, x='date', y='unit_sold',
title='Time Series of Total Foods3 Sales',
labels={'unit_sold': 'Total Units Sold', 'date': 'Date'})
fig.show()
# Aggregate unit_sold by state
state_level_time_series = Foods3.groupby(['date', 'state_id'], as_index=False)['unit_sold'].sum()
fig = px.line(state_level_time_series,x='date', y='unit_sold', color='state_id',facet_col='state_id',
title='Time Series of Foods3 Sales at State Level',
labels={'unit_sold': 'Total Units Sold', 'date': 'Date', 'state_id': 'State'},
height=600,
facet_col_wrap=1
)
fig.update_layout(
showlegend=False,
xaxis_title="Date",
yaxis_title="Total Units Sold",
margin=dict(l=40, r=40, t=40, b=40)
)
fig.show()
# Aggregate unit_sold by store
time_series_data = Foods3.groupby(['date', 'store_id', 'state_id'], as_index=False)['unit_sold'].sum()
fig = px.line(time_series_data, x='date', y='unit_sold', color='store_id', facet_row='state_id',
title='Time Series of Foods3 Sales per Store',
labels={'unit_sold': 'Total Units Sold', 'date': 'Date', 'store_id': 'Store', 'state_id': 'State'},
height=900)
fig.show()
Foods3_CA = Foods3[Foods3['state_id'] == 'CA'].copy()
Foods3_CA.head()
| id | item_id | dept_id | cat_id | store_id | state_id | d | unit_sold | date | wm_yr_wk | ... | year | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | snap_TX | snap_WI | sell_price | revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2226 | FOODS_3_001_CA_1_evaluation | FOODS_3_001 | FOODS_3 | FOODS | CA_1 | CA | d_1 | 1 | 2011-01-29 | 11101 | ... | 2011 | NaN | NaN | NaN | NaN | False | False | False | 2.279297 | 2.279297 |
| 2227 | FOODS_3_002_CA_1_evaluation | FOODS_3_002 | FOODS_3 | FOODS | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 2011 | NaN | NaN | NaN | NaN | False | False | False | 0.000000 | 0.000000 |
| 2228 | FOODS_3_003_CA_1_evaluation | FOODS_3_003 | FOODS_3 | FOODS | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 2011 | NaN | NaN | NaN | NaN | False | False | False | 0.000000 | 0.000000 |
| 2229 | FOODS_3_004_CA_1_evaluation | FOODS_3_004 | FOODS_3 | FOODS | CA_1 | CA | d_1 | 0 | 2011-01-29 | 11101 | ... | 2011 | NaN | NaN | NaN | NaN | False | False | False | 0.000000 | 0.000000 |
| 2230 | FOODS_3_005_CA_1_evaluation | FOODS_3_005 | FOODS_3 | FOODS | CA_1 | CA | d_1 | 1 | 2011-01-29 | 11101 | ... | 2011 | NaN | NaN | NaN | NaN | False | False | False | 1.980469 | 1.980469 |
5 rows × 23 columns
columns_to_drop = ["id", "item_id","state_id", "dept_id", "cat_id", "wm_yr_wk", "year","weekday","wday","month","snap_TX","snap_WI"]
Foods3_CA = Foods3_CA.drop(columns=columns_to_drop)
Foods3_CA.head()
| store_id | d | unit_sold | date | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | sell_price | revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2226 | CA_1 | d_1 | 1 | 2011-01-29 | NaN | NaN | NaN | NaN | False | 2.279297 | 2.279297 |
| 2227 | CA_1 | d_1 | 0 | 2011-01-29 | NaN | NaN | NaN | NaN | False | 0.000000 | 0.000000 |
| 2228 | CA_1 | d_1 | 0 | 2011-01-29 | NaN | NaN | NaN | NaN | False | 0.000000 | 0.000000 |
| 2229 | CA_1 | d_1 | 0 | 2011-01-29 | NaN | NaN | NaN | NaN | False | 0.000000 | 0.000000 |
| 2230 | CA_1 | d_1 | 1 | 2011-01-29 | NaN | NaN | NaN | NaN | False | 1.980469 | 1.980469 |
Foods3_CA["snap_CA"] = Foods3_CA["snap_CA"].astype(int)
Foods3_CA.head()
| store_id | d | unit_sold | date | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | sell_price | revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2226 | CA_1 | d_1 | 1 | 2011-01-29 | NaN | NaN | NaN | NaN | 0 | 2.279297 | 2.279297 |
| 2227 | CA_1 | d_1 | 0 | 2011-01-29 | NaN | NaN | NaN | NaN | 0 | 0.000000 | 0.000000 |
| 2228 | CA_1 | d_1 | 0 | 2011-01-29 | NaN | NaN | NaN | NaN | 0 | 0.000000 | 0.000000 |
| 2229 | CA_1 | d_1 | 0 | 2011-01-29 | NaN | NaN | NaN | NaN | 0 | 0.000000 | 0.000000 |
| 2230 | CA_1 | d_1 | 1 | 2011-01-29 | NaN | NaN | NaN | NaN | 0 | 1.980469 | 1.980469 |
Foods3_CA_agg = Foods3_CA.groupby(["d","store_id"], as_index=False).agg(
{
"date": "first",
"d":"first",
"store_id":"first",
"unit_sold": "sum", # Summing total units sold
"sell_price": "mean", # Averaging sell price
"revenue": "sum", # Summing total revenue
"event_name_1": "first",
"event_type_1": "first",
"event_name_2": "first",
"event_type_2": "first",
"snap_CA": "max",
}
)
Foods3_CA_agg = Foods3_CA_agg.dropna(subset=["d", "store_id"])
Foods3_CA_agg.head()
| date | d | store_id | unit_sold | sell_price | revenue | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-29 | d_1 | CA_1 | 2268 | 1.161917 | 4323.710449 | None | None | None | None | 0.0 |
| 1 | 2011-01-29 | d_1 | CA_2 | 1575 | 1.027044 | 3093.028076 | None | None | None | None | 0.0 |
| 2 | 2011-01-29 | d_1 | CA_3 | 2478 | 1.136871 | 4282.417969 | None | None | None | None | 0.0 |
| 3 | 2011-01-29 | d_1 | CA_4 | 759 | 1.047593 | 1566.507812 | None | None | None | None | 0.0 |
| 10 | 2011-02-07 | d_10 | CA_1 | 1711 | 1.220468 | 3310.140137 | None | None | None | None | 1.0 |
Foods3_CA_agg.info
<bound method DataFrame.info of date d store_id unit_sold sell_price revenue \
0 2011-01-29 d_1 CA_1 2268 1.161917 4323.710449
1 2011-01-29 d_1 CA_2 1575 1.027044 3093.028076
2 2011-01-29 d_1 CA_3 2478 1.136871 4282.417969
3 2011-01-29 d_1 CA_4 759 1.047593 1566.507812
10 2011-02-07 d_10 CA_1 1711 1.220468 3310.140137
... ... ... ... ... ... ...
19393 2013-10-22 d_998 CA_4 1093 2.349193 2560.722412
19400 2013-10-23 d_999 CA_1 1822 2.393991 4388.458008
19401 2013-10-23 d_999 CA_2 1018 2.250309 2336.334473
19402 2013-10-23 d_999 CA_3 2643 2.363897 5701.542969
19403 2013-10-23 d_999 CA_4 937 2.349193 2224.834473
event_name_1 event_type_1 event_name_2 event_type_2 snap_CA
0 None None None None 0.0
1 None None None None 0.0
2 None None None None 0.0
3 None None None None 0.0
10 None None None None 1.0
... ... ... ... ... ...
19393 None None None None 0.0
19400 None None None None 0.0
19401 None None None None 0.0
19402 None None None None 0.0
19403 None None None None 0.0
[7764 rows x 11 columns]>
# Filter rows where event_name_2 is not None and include event_name_1 on the same date
event1_event2_dates = Foods3_CA_agg.dropna(subset=["event_name_2"])[["date", "event_name_1", "event_name_2"]]
unique_event1_event2_dates = event1_event2_dates.drop_duplicates()
print("Unique Dates with Event Name 1 and Event Name 2:")
print(unique_event1_event2_dates)
Unique Dates with Event Name 1 and Event Name 2:
date event_name_1 event_name_2
1990 2014-04-20 Easter OrthodoxEaster
2620 2014-06-15 NBAFinalsEnd Father's day
17510 2013-05-05 OrthodoxEaster Cinco De Mayo
17860 2011-04-24 OrthodoxEaster Easter
event_dummies1 = pd.get_dummies(Foods3_CA_agg["event_name_1"], prefix="event1")
Foods3_CA_corr = pd.concat([Foods3_CA_agg, event_dummies1], axis=1)
numeric_cols = ["unit_sold"] + list(event_dummies1.columns) # Only unit_sold and event dummies
correlation_matrix = Foods3_CA_corr[numeric_cols].corr()
event_correlations = correlation_matrix["unit_sold"].drop("unit_sold").sort_values(ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=event_correlations.values, y=event_correlations.index, palette="coolwarm")
plt.xlabel("Correlation with Unit Sold")
plt.ylabel("Event Name1")
plt.title("Correlation Between Events and Units Sold")
plt.show()
top_5_absolute_correlated_events_unit = event_correlations.abs().sort_values(ascending=False).head(5).reset_index()
top_5_absolute_correlated_events_unit.columns = ["event_name_1", "absolute_correlation_with_unit_sold"]
print(top_5_absolute_correlated_events_unit)
event_name_1 absolute_correlation_with_unit_sold 0 event1_Christmas 0.104071 1 event1_LaborDay 0.033212 2 event1_Thanksgiving 0.031481 3 event1_NewYear 0.029564 4 event1_SuperBowl 0.028783
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Foods3_CA_corr = pd.concat([Foods3_CA_agg, event_dummies1], axis=1)
numeric_cols = ["revenue"] + list(event_dummies1.columns) # Only revenue and event dummies
correlation_matrix = Foods3_CA_corr[numeric_cols].corr()
event_correlations_revenue = correlation_matrix["revenue"].drop("revenue").sort_values(ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=event_correlations_revenue.values, y=event_correlations_revenue.index, palette="coolwarm")
plt.xlabel("Correlation with Revenue")
plt.ylabel("Event Name")
plt.title("Correlation Between Events and Revenue")
plt.show()
top_5_absolute_correlated_events_revenue = event_correlations_revenue.abs().sort_values(ascending=False).head(5).reset_index()
top_5_absolute_correlated_events_revenue.columns = ["event_name_1", "absolute_correlation_revenue"]
print(top_5_absolute_correlated_events_revenue)
event_name_1 absolute_correlation_revenue 0 event1_Christmas 0.107807 1 event1_Thanksgiving 0.036987 2 event1_NewYear 0.030744 3 event1_SuperBowl 0.030330 4 event1_LaborDay 0.030169
top_events = ["Christmas", "Thanksgiving", "NewYear", "SuperBowl", "LaborDay"]
event_dummies = pd.get_dummies(Foods3_CA_agg["event_name_1"], prefix="event1", dtype=int)
event_dummies = event_dummies[[f"event1_{event}" for event in top_events if f"event1_{event}" in event_dummies.columns]]
Foods3_CA_dummies = pd.concat([Foods3_CA_agg, event_dummies], axis=1)
Foods3_CA_dummies.head()
| date | d | store_id | unit_sold | sell_price | revenue | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-29 | d_1 | CA_1 | 2268 | 1.161917 | 4323.710449 | None | None | None | None | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2011-01-29 | d_1 | CA_2 | 1575 | 1.027044 | 3093.028076 | None | None | None | None | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2011-01-29 | d_1 | CA_3 | 2478 | 1.136871 | 4282.417969 | None | None | None | None | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2011-01-29 | d_1 | CA_4 | 759 | 1.047593 | 1566.507812 | None | None | None | None | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 10 | 2011-02-07 | d_10 | CA_1 | 1711 | 1.220468 | 3310.140137 | None | None | None | None | 1.0 | 0 | 0 | 0 | 0 | 0 |
Foods3_CA_df= Foods3_CA_dummies.drop(columns=["d","event_name_1", "event_name_2", "event_type_1", "event_type_2", "event_category"], errors="ignore")
Foods3_CA_df.head()
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-29 | CA_1 | 2268 | 1.161917 | 4323.710449 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2011-01-29 | CA_2 | 1575 | 1.027044 | 3093.028076 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2011-01-29 | CA_3 | 2478 | 1.136871 | 4282.417969 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2011-01-29 | CA_4 | 759 | 1.047593 | 1566.507812 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 10 | 2011-02-07 | CA_1 | 1711 | 1.220468 | 3310.140137 | 1.0 | 0 | 0 | 0 | 0 | 0 |
Foods3_CA_df.info
<bound method DataFrame.info of date store_id unit_sold sell_price revenue snap_CA \
0 2011-01-29 CA_1 2268 1.161917 4323.710449 0.0
1 2011-01-29 CA_2 1575 1.027044 3093.028076 0.0
2 2011-01-29 CA_3 2478 1.136871 4282.417969 0.0
3 2011-01-29 CA_4 759 1.047593 1566.507812 0.0
10 2011-02-07 CA_1 1711 1.220468 3310.140137 1.0
... ... ... ... ... ... ...
19393 2013-10-22 CA_4 1093 2.349193 2560.722412 0.0
19400 2013-10-23 CA_1 1822 2.393991 4388.458008 0.0
19401 2013-10-23 CA_2 1018 2.250309 2336.334473 0.0
19402 2013-10-23 CA_3 2643 2.363897 5701.542969 0.0
19403 2013-10-23 CA_4 937 2.349193 2224.834473 0.0
event1_Christmas event1_Thanksgiving event1_NewYear \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
10 0 0 0
... ... ... ...
19393 0 0 0
19400 0 0 0
19401 0 0 0
19402 0 0 0
19403 0 0 0
event1_SuperBowl event1_LaborDay
0 0 0
1 0 0
2 0 0
3 0 0
10 0 0
... ... ...
19393 0 0
19400 0 0
19401 0 0
19402 0 0
19403 0 0
[7764 rows x 11 columns]>
Foods3_CA_df = Foods3_CA_df.sort_values(by="date")
Foods3_CA_df
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-29 | CA_1 | 2268 | 1.161917 | 4323.710449 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2011-01-29 | CA_2 | 1575 | 1.027044 | 3093.028076 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2011-01-29 | CA_3 | 2478 | 1.136871 | 4282.417969 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2011-01-29 | CA_4 | 759 | 1.047593 | 1566.507812 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 10533 | 2011-01-30 | CA_4 | 798 | 1.047593 | 1600.170898 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10462 | 2016-05-21 | CA_3 | 3102 | 2.919633 | 7805.052246 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 10471 | 2016-05-22 | CA_2 | 2628 | 2.921383 | 6800.089844 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 10472 | 2016-05-22 | CA_3 | 3573 | 2.919633 | 8930.868164 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 10473 | 2016-05-22 | CA_4 | 1566 | 2.927146 | 4067.809082 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 10470 | 2016-05-22 | CA_1 | 3274 | 2.926042 | 8687.856445 | 0.0 | 0 | 0 | 0 | 0 | 0 |
7764 rows × 11 columns
Foods3_CA_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 7764 entries, 0 to 10470 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 7764 non-null datetime64[ns] 1 store_id 7764 non-null category 2 unit_sold 7764 non-null int16 3 sell_price 7764 non-null float32 4 revenue 7764 non-null float32 5 snap_CA 7764 non-null float64 6 event1_Christmas 7764 non-null int32 7 event1_Thanksgiving 7764 non-null int32 8 event1_NewYear 7764 non-null int32 9 event1_SuperBowl 7764 non-null int32 10 event1_LaborDay 7764 non-null int32 dtypes: category(1), datetime64[ns](1), float32(2), float64(1), int16(1), int32(5) memory usage: 417.4 KB
# Foods3_CA_df.to_csv("C:/Users/97610/Downloads/m5-forecasting-accuracy/time series data_2011-2016_4 stores revenue in Foods3 department with 5 major events.csv")
Foods3_CA_df = pd.read_csv('C:/Users/97610/Downloads/m5-forecasting-accuracy/time series data_2011-2016_4 stores revenue in Foods3 department with 5 major events.csv')
Foods3_CA_df.drop('Unnamed: 0', axis=1, inplace=True)
Foods3_CA_df.head()
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-29 | CA_1 | 2268 | 1.161917 | 4323.7104 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2011-01-29 | CA_2 | 1575 | 1.027044 | 3093.0280 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2011-01-29 | CA_3 | 2478 | 1.136871 | 4282.4180 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2011-01-29 | CA_4 | 759 | 1.047593 | 1566.5078 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2011-01-30 | CA_4 | 798 | 1.047593 | 1600.1709 | 0.0 | 0 | 0 | 0 | 0 | 0 |
Foods3_CA_df.tail()
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7759 | 2016-05-21 | CA_3 | 3102 | 2.919633 | 7805.0522 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7760 | 2016-05-22 | CA_2 | 2628 | 2.921383 | 6800.0900 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7761 | 2016-05-22 | CA_3 | 3573 | 2.919633 | 8930.8680 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7762 | 2016-05-22 | CA_4 | 1566 | 2.927146 | 4067.8090 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7763 | 2016-05-22 | CA_1 | 3274 | 2.926042 | 8687.8560 | 0.0 | 0 | 0 | 0 | 0 | 0 |
Since 2011 and 2016 have incomplete month data, we only use 2012 to 2015 for our project analysis¶
Foods3_CA_df["date"] = pd.to_datetime(Foods3_CA_df["date"], format="%Y-%m-%d")
Foods3_CA_df = Foods3_CA_df[
(Foods3_CA_df["date"] >= pd.Timestamp("2012-01-01")) &
(Foods3_CA_df["date"] <= pd.Timestamp("2015-12-31"))
]
Foods3_CA_df.head()
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1348 | 2012-01-01 | CA_1 | 1448 | 1.645839 | 3213.3958 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1349 | 2012-01-01 | CA_3 | 2051 | 1.630491 | 4246.9053 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1350 | 2012-01-01 | CA_4 | 738 | 1.589748 | 1637.3416 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1351 | 2012-01-01 | CA_2 | 941 | 1.473136 | 1994.1145 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1352 | 2012-01-02 | CA_1 | 2001 | 1.645839 | 4684.9550 | 1.0 | 0 | 0 | 0 | 0 | 0 |
Foods3_CA_df.tail()
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7187 | 2015-12-30 | CA_2 | 1625 | 2.923421 | 3926.1920 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7188 | 2015-12-31 | CA_4 | 1072 | 2.924935 | 2814.8704 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7189 | 2015-12-31 | CA_1 | 1940 | 2.921556 | 5333.0360 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7190 | 2015-12-31 | CA_2 | 2087 | 2.923421 | 5141.0130 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7191 | 2015-12-31 | CA_3 | 2682 | 2.875478 | 6914.9814 | 0.0 | 0 | 0 | 0 | 0 | 0 |
Foods3_CA_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 5844 entries, 1348 to 7191 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 5844 non-null datetime64[ns] 1 store_id 5844 non-null object 2 unit_sold 5844 non-null int64 3 sell_price 5844 non-null float64 4 revenue 5844 non-null float64 5 snap_CA 5844 non-null float64 6 event1_Christmas 5844 non-null int64 7 event1_Thanksgiving 5844 non-null int64 8 event1_NewYear 5844 non-null int64 9 event1_SuperBowl 5844 non-null int64 10 event1_LaborDay 5844 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(6), object(1) memory usage: 547.9+ KB
fig = px.line(
Foods3_CA_df,
x="date",
y="revenue",
color="store_id",
title="Time Series of Revenue Over Time",
labels={"revenue": "Total Revenue", "date": "Date", "store_id": "Store ID"},
markers=True
)
fig.show()
fig = px.box(
Foods3_CA_df,
x="store_id",
y="revenue",
title="Revenue Distribution by Store",
labels={"revenue": "Total Revenue", "store_id": "Store ID"},
notched=True
)
fig.show()
# Remove unused columns
# Foods3_CA_df['store_id'] = Foods3_CA_df['store_id'].cat.remove_unused_categories()
# Calculate store revenue percentage
# 1. Sum revenue by store
store_revenue = Foods3_CA_df.groupby('store_id')['revenue'].sum().reset_index()
# 2. Calculate the overall total revenue
total_revenue = store_revenue['revenue'].sum()
# 3. Compute each store's revenue percentage
store_revenue['revenue_percentage'] = (store_revenue['revenue'] / total_revenue) * 100
store_revenue.style.format({'revenue': '{:,.2f}'})
| store_id | revenue | revenue_percentage | |
|---|---|---|---|
| 0 | CA_1 | 7,350,989.74 | 29.276478 |
| 1 | CA_2 | 4,219,153.53 | 16.803445 |
| 2 | CA_3 | 9,835,288.64 | 39.170590 |
| 3 | CA_4 | 3,703,428.06 | 14.749487 |
fig = px.bar(
store_revenue,
x='store_id',
y='revenue_percentage',
title='Revenue Percentage by Store',
labels={'store_id': 'Store ID', 'percentage': 'Revenue Percentage'},
text='revenue_percentage', # Display the percentage above each bar
height=700, # Increase the height
)
# Optionally format the text to show fewer decimals
fig.update_traces(texttemplate='%{text:.2f}%', textposition='outside')
fig.show()
# Aggregate total revenue by date
total_revenue_by_date = Foods3_CA_df.groupby("date", as_index=False)["revenue"].sum()
fig = px.line(
total_revenue_by_date,
x="date",
y="revenue",
title="Time Series of Total Revenue Over Time",
labels={"revenue": "Total Revenue", "date": "Date"},
markers=True
)
fig.show()
total_revenue_by_date
| date | revenue | |
|---|---|---|
| 0 | 2012-01-01 | 11091.7572 |
| 1 | 2012-01-02 | 16695.3801 |
| 2 | 2012-01-03 | 13604.3844 |
| 3 | 2012-01-04 | 12437.8751 |
| 4 | 2012-01-05 | 14248.9240 |
| ... | ... | ... |
| 1456 | 2015-12-27 | 16427.9190 |
| 1457 | 2015-12-28 | 15311.9904 |
| 1458 | 2015-12-29 | 14675.1821 |
| 1459 | 2015-12-30 | 16201.4307 |
| 1460 | 2015-12-31 | 20203.9008 |
1461 rows × 2 columns
# total_revenue_by_date.to_csv("C:/Users/97610/Downloads/m5-forecasting-accuracy/time series data_2012-2015_Foods3 department revenue.csv")
Foods3_CA_df
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1348 | 2012-01-01 | CA_1 | 1448 | 1.645839 | 3213.3958 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1349 | 2012-01-01 | CA_3 | 2051 | 1.630491 | 4246.9053 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1350 | 2012-01-01 | CA_4 | 738 | 1.589748 | 1637.3416 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1351 | 2012-01-01 | CA_2 | 941 | 1.473136 | 1994.1145 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1352 | 2012-01-02 | CA_1 | 2001 | 1.645839 | 4684.9550 | 1.0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7187 | 2015-12-30 | CA_2 | 1625 | 2.923421 | 3926.1920 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7188 | 2015-12-31 | CA_4 | 1072 | 2.924935 | 2814.8704 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7189 | 2015-12-31 | CA_1 | 1940 | 2.921556 | 5333.0360 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7190 | 2015-12-31 | CA_2 | 2087 | 2.923421 | 5141.0130 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7191 | 2015-12-31 | CA_3 | 2682 | 2.875478 | 6914.9814 | 0.0 | 0 | 0 | 0 | 0 | 0 |
5844 rows × 11 columns
# Foods3_CA_df.to_csv("C:/Users/97610/Downloads/m5-forecasting-accuracy/time series data_2012-2015_4 stores revenue in Foods3 department with 5 major events.csv")
Revenue Forecasting - Traditional Statistical Model (SARIMAX)¶
import pandas as pd
import numpy as np
import statsmodels.api as sm
import pmdarima as pm
# Load visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline
from statsmodels.tsa.seasonal import seasonal_decompose
# Load specific forecasting tools
from statsmodels.tsa.arima_model import ARMA,ARMAResults,ARIMA,ARIMAResults
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf # for determining (p,q) orders
from pmdarima import auto_arima # for determining ARIMA orders
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error
# Foods3_CA_df = pd.read_csv('C:/Users/97610/Downloads/m5-forecasting-accuracy/time series data_2012-2015_4 stores revenue in Foods3 department with 5 major events.csv')
Foods3_CA_df
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1348 | 2012-01-01 | CA_1 | 1448 | 1.645839 | 3213.3958 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1349 | 2012-01-01 | CA_3 | 2051 | 1.630491 | 4246.9053 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1350 | 2012-01-01 | CA_4 | 738 | 1.589748 | 1637.3416 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1351 | 2012-01-01 | CA_2 | 941 | 1.473136 | 1994.1145 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1352 | 2012-01-02 | CA_1 | 2001 | 1.645839 | 4684.9550 | 1.0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7187 | 2015-12-30 | CA_2 | 1625 | 2.923421 | 3926.1920 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7188 | 2015-12-31 | CA_4 | 1072 | 2.924935 | 2814.8704 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7189 | 2015-12-31 | CA_1 | 1940 | 2.921556 | 5333.0360 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7190 | 2015-12-31 | CA_2 | 2087 | 2.923421 | 5141.0130 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 7191 | 2015-12-31 | CA_3 | 2682 | 2.875478 | 6914.9814 | 0.0 | 0 | 0 | 0 | 0 | 0 |
5844 rows × 11 columns
Foods3_CA_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 5844 entries, 1348 to 7191 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 5844 non-null datetime64[ns] 1 store_id 5844 non-null object 2 unit_sold 5844 non-null int64 3 sell_price 5844 non-null float64 4 revenue 5844 non-null float64 5 snap_CA 5844 non-null float64 6 event1_Christmas 5844 non-null int64 7 event1_Thanksgiving 5844 non-null int64 8 event1_NewYear 5844 non-null int64 9 event1_SuperBowl 5844 non-null int64 10 event1_LaborDay 5844 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(6), object(1) memory usage: 547.9+ KB
Data Preprocessing¶
# Convert 'date' to datetime format
Foods3_CA_df['date'] = pd.to_datetime(Foods3_CA_df['date'])
# Drop unnecessary columns (like the unnamed index column)
# Foods3_CA_df.drop(columns=['Unnamed: 0'], inplace=True)
# Check for missing values
missing_values = Foods3_CA_df.isnull().sum()
# Aggregate data if needed (checking for duplicate entries per store per date)
duplicate_check = Foods3_CA_df.duplicated(subset=['date', 'store_id']).sum()
# Display updated dataframe info
Foods3_CA_df.info(), missing_values, duplicate_check
<class 'pandas.core.frame.DataFrame'> Index: 5844 entries, 1348 to 7191 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 5844 non-null datetime64[ns] 1 store_id 5844 non-null object 2 unit_sold 5844 non-null int64 3 sell_price 5844 non-null float64 4 revenue 5844 non-null float64 5 snap_CA 5844 non-null float64 6 event1_Christmas 5844 non-null int64 7 event1_Thanksgiving 5844 non-null int64 8 event1_NewYear 5844 non-null int64 9 event1_SuperBowl 5844 non-null int64 10 event1_LaborDay 5844 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(6), object(1) memory usage: 547.9+ KB
(None, date 0 store_id 0 unit_sold 0 sell_price 0 revenue 0 snap_CA 0 event1_Christmas 0 event1_Thanksgiving 0 event1_NewYear 0 event1_SuperBowl 0 event1_LaborDay 0 dtype: int64, 0)
1. Foods3 Department Revenue Forecasting¶
Feature Engineering & EDA¶
# Aggregate data: Summing revenue per date across all stores
department_daily = Foods3_CA_df.groupby('date').agg({
'revenue': 'sum', # Total revenue per day
'unit_sold': 'sum', # Total units sold per day
'sell_price': 'mean', # Average selling price
'snap_CA': 'mean', # Average SNAP benefit indicator
'event1_Christmas': 'max', # If any store had the event on a given day
'event1_Thanksgiving': 'max',
'event1_NewYear': 'max',
'event1_SuperBowl': 'max',
'event1_LaborDay': 'max'
}).reset_index()
# Display daily aggregated data
department_daily
| date | revenue | unit_sold | sell_price | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | 11091.7572 | 5178 | 1.584803 | 1.0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 2012-01-02 | 16695.3801 | 7517 | 1.584803 | 1.0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2012-01-03 | 13604.3844 | 6368 | 1.584803 | 1.0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2012-01-04 | 12437.8751 | 5748 | 1.584803 | 1.0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2012-01-05 | 14248.9240 | 6501 | 1.584803 | 1.0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1456 | 2015-12-27 | 16427.9190 | 6212 | 2.911347 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 1457 | 2015-12-28 | 15311.9904 | 5924 | 2.911347 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 1458 | 2015-12-29 | 14675.1821 | 5680 | 2.911347 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 1459 | 2015-12-30 | 16201.4307 | 6189 | 2.911347 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 1460 | 2015-12-31 | 20203.9008 | 7781 | 2.911347 | 0.0 | 0 | 0 | 0 | 0 | 0 |
1461 rows × 10 columns
# Plot daily revenue trend
plt.figure(figsize=(12,6))
plt.plot(department_daily['date'], department_daily['revenue'], label='Daily Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('Foods 3 Department Daily Revenue Trend Over Time')
plt.legend()
plt.grid(True)
plt.show()
# Aggregate data: Create column "week" and summing revenue per week across all stores
Foods3_CA_df['week'] = Foods3_CA_df['date'].dt.to_period('W')
department_weekly = Foods3_CA_df.groupby('week').agg({
'revenue': 'sum',
'unit_sold': 'sum',
'sell_price': 'mean',
'snap_CA': 'mean',
'event1_Christmas': 'max',
'event1_Thanksgiving': 'max',
'event1_NewYear': 'max',
'event1_SuperBowl': 'max',
'event1_LaborDay': 'max'
}).reset_index()
department_weekly = department_weekly.iloc[1:-1].reset_index(drop=True)
# Display weekly aggregated data
department_weekly
| week | revenue | unit_sold | sell_price | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-01-02/2012-01-08 | 110698.152200 | 50362 | 1.588490 | 1.000000 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2012-01-09/2012-01-15 | 102482.814800 | 47556 | 1.598924 | 0.285714 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2012-01-16/2012-01-22 | 102863.599500 | 47503 | 1.604877 | 0.000000 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2012-01-23/2012-01-29 | 93079.113900 | 43992 | 1.624494 | 0.000000 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2012-01-30/2012-02-05 | 114189.451100 | 51928 | 1.657096 | 0.714286 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 203 | 2015-11-23/2015-11-29 | 111590.456300 | 45080 | 2.913383 | 0.000000 | 0 | 1 | 0 | 0 | 0 |
| 204 | 2015-11-30/2015-12-06 | 119007.117600 | 46183 | 2.912602 | 0.857143 | 0 | 0 | 0 | 0 | 0 |
| 205 | 2015-12-07/2015-12-13 | 115926.823100 | 45490 | 2.910630 | 0.571429 | 0 | 0 | 0 | 0 | 0 |
| 206 | 2015-12-14/2015-12-20 | 115783.140800 | 44612 | 2.910472 | 0.000000 | 0 | 0 | 0 | 0 | 0 |
| 207 | 2015-12-21/2015-12-27 | 99922.598308 | 38469 | 2.910264 | 0.000000 | 1 | 0 | 0 | 0 | 0 |
208 rows × 10 columns
# Convert the 'week' column from Period to Timestamp
# department_weekly['week'] = department_weekly['week'].dt.to_timestamp()
# Plot weekly revenue trend
plt.figure(figsize=(12,6))
plt.plot(department_weekly['week'], department_weekly['revenue'], label='Weekly Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('Foods 3 Department Weekly Revenue Trend Over Time')
plt.legend()
plt.show()
# Aggregate data: Create column "month" and summing revenue per month across all stores
Foods3_CA_df['month'] = Foods3_CA_df['date'].dt.to_period('M')
department_monthly = Foods3_CA_df.groupby('month').agg({
'revenue': 'sum',
'unit_sold': 'sum',
'sell_price': 'mean',
'snap_CA': 'mean',
'event1_Christmas': 'max',
'event1_Thanksgiving': 'max',
'event1_NewYear': 'max',
'event1_SuperBowl': 'max',
'event1_LaborDay': 'max'
}).reset_index()
# Display monthly aggregated data
department_monthly
| month | revenue | unit_sold | sell_price | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-01 | 446652.993300 | 206796 | 1.606872 | 0.322581 | 0 | 0 | 1 | 0 | 0 |
| 1 | 2012-02 | 457478.040800 | 205482 | 1.664356 | 0.344828 | 0 | 0 | 0 | 1 | 0 |
| 2 | 2012-03 | 527256.333700 | 234350 | 1.728166 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2012-04 | 515853.648100 | 232114 | 1.766734 | 0.333333 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2012-05 | 553696.991600 | 252253 | 1.798810 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 5 | 2012-06 | 550051.849800 | 255203 | 1.821972 | 0.333333 | 0 | 0 | 0 | 0 | 0 |
| 6 | 2012-07 | 533941.888900 | 247467 | 1.832073 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 7 | 2012-08 | 565252.727200 | 257153 | 1.841301 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 8 | 2012-09 | 514575.798200 | 233260 | 1.853635 | 0.333333 | 0 | 0 | 0 | 0 | 1 |
| 9 | 2012-10 | 463063.788800 | 216610 | 1.886128 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 10 | 2012-11 | 409666.099600 | 193544 | 1.908485 | 0.333333 | 0 | 1 | 0 | 0 | 0 |
| 11 | 2012-12 | 449432.537416 | 208390 | 1.939479 | 0.322581 | 1 | 0 | 0 | 0 | 0 |
| 12 | 2013-01 | 451909.413200 | 204266 | 1.977420 | 0.322581 | 0 | 0 | 1 | 0 | 0 |
| 13 | 2013-02 | 416719.174500 | 189610 | 2.021923 | 0.357143 | 0 | 0 | 0 | 1 | 0 |
| 14 | 2013-03 | 469418.195600 | 218046 | 2.063936 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 15 | 2013-04 | 469187.201400 | 218086 | 2.096399 | 0.333333 | 0 | 0 | 0 | 0 | 0 |
| 16 | 2013-05 | 483007.078600 | 226260 | 2.132688 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 17 | 2013-06 | 533400.292800 | 251893 | 2.164049 | 0.333333 | 0 | 0 | 0 | 0 | 0 |
| 18 | 2013-07 | 552472.551900 | 260784 | 2.181937 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 19 | 2013-08 | 565352.365700 | 274857 | 2.210481 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 20 | 2013-09 | 557284.346200 | 271261 | 2.276460 | 0.333333 | 0 | 0 | 0 | 0 | 1 |
| 21 | 2013-10 | 565761.394900 | 251945 | 2.335426 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 22 | 2013-11 | 495937.825500 | 227489 | 2.343453 | 0.333333 | 0 | 1 | 0 | 0 | 0 |
| 23 | 2013-12 | 484834.025153 | 223967 | 2.351611 | 0.322581 | 1 | 0 | 0 | 0 | 0 |
| 24 | 2014-01 | 502707.159500 | 231600 | 2.399634 | 0.322581 | 0 | 0 | 1 | 0 | 0 |
| 25 | 2014-02 | 469008.767100 | 204153 | 2.450571 | 0.357143 | 0 | 0 | 0 | 1 | 0 |
| 26 | 2014-03 | 567044.516300 | 243631 | 2.500773 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 27 | 2014-04 | 582852.901900 | 252596 | 2.531435 | 0.333333 | 0 | 0 | 0 | 0 | 0 |
| 28 | 2014-05 | 582902.376400 | 248245 | 2.554456 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 29 | 2014-06 | 560458.577300 | 253103 | 2.592864 | 0.333333 | 0 | 0 | 0 | 0 | 0 |
| 30 | 2014-07 | 587890.839800 | 264626 | 2.627328 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 31 | 2014-08 | 587159.544800 | 260661 | 2.650960 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 32 | 2014-09 | 552359.528300 | 242877 | 2.689410 | 0.333333 | 0 | 0 | 0 | 0 | 1 |
| 33 | 2014-10 | 571807.364100 | 230924 | 2.718435 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 34 | 2014-11 | 540479.641000 | 219940 | 2.747414 | 0.333333 | 0 | 1 | 0 | 0 | 0 |
| 35 | 2014-12 | 491734.536911 | 193572 | 2.756147 | 0.322581 | 1 | 0 | 0 | 0 | 0 |
| 36 | 2015-01 | 525618.729000 | 215120 | 2.792603 | 0.322581 | 0 | 0 | 1 | 0 | 0 |
| 37 | 2015-02 | 469823.563900 | 189967 | 2.831282 | 0.357143 | 0 | 0 | 0 | 1 | 0 |
| 38 | 2015-03 | 536504.483700 | 216395 | 2.858263 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 39 | 2015-04 | 515726.202600 | 204760 | 2.870039 | 0.333333 | 0 | 0 | 0 | 0 | 0 |
| 40 | 2015-05 | 533872.967000 | 212946 | 2.879448 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 41 | 2015-06 | 535634.996200 | 232264 | 2.889860 | 0.333333 | 0 | 0 | 0 | 0 | 0 |
| 42 | 2015-07 | 578047.229000 | 245723 | 2.890509 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 43 | 2015-08 | 592465.328800 | 250992 | 2.895919 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 44 | 2015-09 | 568701.797800 | 239233 | 2.904005 | 0.333333 | 0 | 0 | 0 | 0 | 1 |
| 45 | 2015-10 | 603689.184300 | 245515 | 2.913604 | 0.322581 | 0 | 0 | 0 | 0 | 0 |
| 46 | 2015-11 | 518822.576500 | 206328 | 2.913687 | 0.333333 | 0 | 1 | 0 | 0 | 0 |
| 47 | 2015-12 | 501340.598508 | 194361 | 2.910960 | 0.322581 | 1 | 0 | 0 | 0 | 0 |
# Convert the 'month' column from Period to Timestamp
department_monthly['month'] = department_monthly['month'].dt.to_timestamp()
plt.figure(figsize=(12,6))
plt.plot(department_monthly['month'], department_monthly['revenue'], label='Monthly Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('Foods 3 Department Monthly Revenue Trend Over Time')
plt.legend()
plt.show()
# Aggregate data: Create column "year" and summing revenue per year across all stores
Foods3_CA_df['year'] = Foods3_CA_df['date'].dt.to_period('Y')
department_yearly = Foods3_CA_df.groupby('year').agg({
'revenue': 'sum',
'unit_sold': 'sum',
'sell_price': 'mean',
'snap_CA': 'mean',
'event1_Christmas': 'max',
'event1_Thanksgiving': 'max',
'event1_NewYear': 'max',
'event1_SuperBowl': 'max',
'event1_LaborDay': 'max'
}).reset_index()
# Display yearly aggregated data
department_yearly
| year | revenue | unit_sold | sell_price | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012 | 5.986923e+06 | 2742622 | 1.804395 | 0.327869 | 1 | 1 | 1 | 1 | 1 |
| 1 | 2013 | 6.045284e+06 | 2818464 | 2.180502 | 0.328767 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2014 | 6.596406e+06 | 2845928 | 2.602437 | 0.328767 | 1 | 1 | 1 | 1 | 1 |
| 3 | 2015 | 6.480248e+06 | 2653604 | 2.879409 | 0.328767 | 1 | 1 | 1 | 1 | 1 |
# Convert the 'year' column from Period to Timestamp
department_yearly['year'] = department_yearly['year'].dt.to_timestamp()
plt.figure(figsize=(12,6))
plt.plot(department_yearly['year'], department_yearly['revenue'], label='Yearly Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('Foods 3 Department Yearly Revenue Trend Over Time')
plt.legend()
plt.show()
Foods3_CA_df
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | week | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1348 | 2012-01-01 | CA_1 | 1448 | 1.645839 | 3213.3958 | 1.0 | 0 | 0 | 1 | 0 | 0 | 2011-12-26/2012-01-01 | 2012-01 | 2012 |
| 1349 | 2012-01-01 | CA_3 | 2051 | 1.630491 | 4246.9053 | 1.0 | 0 | 0 | 1 | 0 | 0 | 2011-12-26/2012-01-01 | 2012-01 | 2012 |
| 1350 | 2012-01-01 | CA_4 | 738 | 1.589748 | 1637.3416 | 1.0 | 0 | 0 | 1 | 0 | 0 | 2011-12-26/2012-01-01 | 2012-01 | 2012 |
| 1351 | 2012-01-01 | CA_2 | 941 | 1.473136 | 1994.1145 | 1.0 | 0 | 0 | 1 | 0 | 0 | 2011-12-26/2012-01-01 | 2012-01 | 2012 |
| 1352 | 2012-01-02 | CA_1 | 2001 | 1.645839 | 4684.9550 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7187 | 2015-12-30 | CA_2 | 1625 | 2.923421 | 3926.1920 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7188 | 2015-12-31 | CA_4 | 1072 | 2.924935 | 2814.8704 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7189 | 2015-12-31 | CA_1 | 1940 | 2.921556 | 5333.0360 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7190 | 2015-12-31 | CA_2 | 2087 | 2.923421 | 5141.0130 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7191 | 2015-12-31 | CA_3 | 2682 | 2.875478 | 6914.9814 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
5844 rows × 14 columns
Proceed With Department Daily Data (Since Event Happens Daily)¶
department_daily.describe()
| date | revenue | unit_sold | sell_price | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 1461 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 |
| mean | 2013-12-31 00:00:00 | 17186.078011 | 7570.580424 | 2.366301 | 0.328542 | 0.002738 | 0.002738 | 0.002738 | 0.002738 | 0.002738 |
| min | 2012-01-01 00:00:00 | 8.299316 | 5.000000 | 1.584803 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2012-12-31 00:00:00 | 14804.731200 | 6466.000000 | 1.950130 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 2013-12-31 00:00:00 | 16695.380100 | 7332.000000 | 2.368837 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 2014-12-31 00:00:00 | 19387.146800 | 8571.000000 | 2.763858 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 2015-12-31 00:00:00 | 27796.757300 | 12852.000000 | 2.915013 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| std | NaN | 3457.469523 | 1538.943798 | 0.420759 | 0.469844 | 0.052271 | 0.052271 | 0.052271 | 0.052271 | 0.052271 |
department_daily.shape
(1461, 10)
# Plot histograms for each feature (Ignore first plot - date column)
department_daily.iloc[:, 1:].hist(bins=30, figsize=(12, 10))
plt.tight_layout()
plt.show()
# Select only numeric columns from the DataFrame
numeric_df = department_daily.select_dtypes(include=['number'])
# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.show()
unit_soldandrevenueare highly positive correlated, indicating higher units sold directly cause higher revenue
sell_priceandrevenuealso have moderately positive correlation, implying higher price also contribute to higher revenue
sell_priceandunit_soldhave slightly positive correlation, which is counter-intuitive, since higher price should lead to lower sales
event1_Christmas,event1_Thanksgiving,event1_NewYearare negatively correlated withrevenue, meaning people stay at home and buy less on those event days, it could also be store closure on those days (especially for Christmas, which has the most nagetive impact on revenue)
event1_SuperBowlandevent1_LaborDayare positively correlated withrevenue, meaning people are buying more on those days, but the effect is very slight
snap_CAslightly contributes torevenueandunit_sold; interestingly, people are using less SNAP on Christmas and Thanksgiving, but using more SNAP on New Year (I guess the reason could be that stores close on Christmas and Thanksgiving, but open on New Year)
# Check for seasonality using a rolling mean
department_daily['revenue_rolling_mean'] = department_daily['revenue'].rolling(window=30, min_periods=1).mean()
# Plot rolling mean trend
plt.figure(figsize=(12,6))
plt.plot(department_daily['date'], department_daily['revenue'], label='Daily Revenue', alpha=0.5)
plt.plot(department_daily['date'], department_daily['revenue_rolling_mean'], label='30-Day Rolling Mean', color='red')
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('Foods 3 Department Daily Revenue Trend with 30-Day Rolling Mean')
plt.legend()
plt.grid(True)
plt.show()
The plot indicates strong seasonality with predictable peaks around key events and a slightly rising baseline over the years. The 30-day rolling mean is especially helpful for seeing the bigger picture behind the daily volatility¶
# Perform time series decomposition
decomposition = seasonal_decompose(department_daily.set_index('date')['revenue'], model='additive', period=365)
# Plot decomposition components
plt.figure(figsize=(12,8))
plt.subplot(4, 1, 1)
plt.plot(decomposition.observed, label='Observed', color='black')
plt.legend(loc='upper left')
plt.subplot(4, 1, 2)
plt.plot(decomposition.trend, label='Trend', color='blue')
plt.legend(loc='upper left')
plt.subplot(4, 1, 3)
plt.plot(decomposition.seasonal, label='Seasonal', color='green')
plt.legend(loc='upper left')
plt.subplot(4, 1, 4)
plt.plot(decomposition.resid, label='Residual', color='red')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
Both a gradual, long-term trend and recurring seasonal patterns play significant roles in this revenue time series, with the remaining “residual” fluctuations being relatively random day-to-day noise¶
# Extract seasonal component
seasonality_strength = np.var(decomposition.seasonal) / np.var(department_daily['revenue'])
# Display the seasonality strength
seasonality_strength
0.5453845443618595
Roughly 55% of the variance in department revenue series is explained by seasonality, indicating a strong degree of seasonality¶
# Autocorrelation and Partial Autocorrelation plots
plt.figure(figsize=(12,4))
plot_acf(department_daily['revenue'], lags=50, zero=False, ax=plt.gca()) # Autocorrelation from lag 1
plt.title('Autocorrelation of Revenue (starting from lag 1)')
plt.show()
plt.figure(figsize=(12,4))
plot_pacf(department_daily['revenue'], lags=50, zero=False, ax=plt.gca()) # Partial Autocorrelation from lag 1
plt.title('Partial Autocorrelation of Revenue (starting from lag 1)')
plt.show()
PACF is p or AR, pretty obvious that p is 1
ACF is q or MA, not obvious that maybe q is 1 as well
Begin Modeling¶
Apply ADF test to check stationarity¶
# Automate Augmented Dickey-Fuller (ADF) test for stationarity
def adf_test(series, title=''):
"""
Pass in a time series and an optional title, returns an ADF report.
"""
print(f'Augmented Dickey-Fuller Test: {title}')
result = adfuller(series.dropna(), autolag='AIC') # .dropna() handles differenced data
labels = ['ADF test statistic', 'p-value', '# lags used', '# observations']
out = pd.Series(result[0:4], index=labels)
for key, val in result[4].items():
out[f'critical value ({key})'] = val
print(out.to_string()) # .to_string() removes the line "dtype: float64"
if result[1] <= 0.05:
print("Strong evidence against the null hypothesis")
print("Reject the null hypothesis")
print("Data has no unit root and is stationary")
else:
print("Weak evidence against the null hypothesis")
print("Fail to reject the null hypothesis")
print("Data has a unit root and is non-stationary")
adf_test(department_daily['revenue'])
Augmented Dickey-Fuller Test: ADF test statistic -3.314950 p-value 0.014223 # lags used 24.000000 # observations 1436.000000 critical value (1%) -3.434912 critical value (5%) -2.863555 critical value (10%) -2.567843 Strong evidence against the null hypothesis Reject the null hypothesis Data has no unit root and is stationary
Since it is a stationary data, we don't need to difference it to achieve stationary, skip the below differencing steps
# # Apply first-order differencing
# department_daily['revenue_diff'] = department_daily['revenue'].diff()
# # Perform ADF test again on the differenced series
# adf_test(department_daily['revenue_diff'])
Augmented Dickey-Fuller Test: ADF test statistic -1.420797e+01 p-value 1.741457e-26 # lags used 2.400000e+01 # observations 1.435000e+03 critical value (1%) -3.434915e+00 critical value (5%) -2.863556e+00 critical value (10%) -2.567843e+00 Strong evidence against the null hypothesis Reject the null hypothesis Data has no unit root and is stationary
# # Plot the differenced revenue
# plt.figure(figsize=(12,6))
# plt.plot(department_daily['date'], department_daily['revenue_diff'], label='Differenced Revenue', color='purple')
# plt.xlabel('Date')
# plt.ylabel('Differenced Revenue')
# plt.title('First-Order Differencing of Revenue to Achieve Stationarity')
# plt.legend()
# plt.grid(True)
# plt.show()
Since we checked seasonality strength (55%) and data has 5 exogenous variables, we proceed with SARIMAX:¶
department_daily
| date | revenue | unit_sold | sell_price | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | revenue_rolling_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | 11091.7572 | 5178 | 1.584803 | 1.0 | 0 | 0 | 1 | 0 | 0 | 11091.757200 |
| 1 | 2012-01-02 | 16695.3801 | 7517 | 1.584803 | 1.0 | 0 | 0 | 0 | 0 | 0 | 13893.568650 |
| 2 | 2012-01-03 | 13604.3844 | 6368 | 1.584803 | 1.0 | 0 | 0 | 0 | 0 | 0 | 13797.173900 |
| 3 | 2012-01-04 | 12437.8751 | 5748 | 1.584803 | 1.0 | 0 | 0 | 0 | 0 | 0 | 13457.349200 |
| 4 | 2012-01-05 | 14248.9240 | 6501 | 1.584803 | 1.0 | 0 | 0 | 0 | 0 | 0 | 13615.664160 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1456 | 2015-12-27 | 16427.9190 | 6212 | 2.911347 | 0.0 | 0 | 0 | 0 | 0 | 0 | 16137.524387 |
| 1457 | 2015-12-28 | 15311.9904 | 5924 | 2.911347 | 0.0 | 0 | 0 | 0 | 0 | 0 | 16148.145014 |
| 1458 | 2015-12-29 | 14675.1821 | 5680 | 2.911347 | 0.0 | 0 | 0 | 0 | 0 | 0 | 16020.895077 |
| 1459 | 2015-12-30 | 16201.4307 | 6189 | 2.911347 | 0.0 | 0 | 0 | 0 | 0 | 0 | 16037.889924 |
| 1460 | 2015-12-31 | 20203.9008 | 7781 | 2.911347 | 0.0 | 0 | 0 | 0 | 0 | 0 | 16207.451810 |
1461 rows × 11 columns
department_daily.isnull().sum()
date 0 revenue 0 unit_sold 0 sell_price 0 snap_CA 0 event1_Christmas 0 event1_Thanksgiving 0 event1_NewYear 0 event1_SuperBowl 0 event1_LaborDay 0 revenue_rolling_mean 0 dtype: int64
# #### When computing a differenced column, the first value naturally becomes NaN because there is no previous value to subtract from. Thus, we fill it with 0
# department_daily["revenue_diff"].fillna(0, inplace=True)
C:\Users\97610\AppData\Local\Temp\ipykernel_8664\3669965985.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
department_daily["revenue_diff"].fillna(0, inplace=True)
# # Define the endogenous (target) variable and exogenous variables
# y = department_daily.set_index('date')['revenue']
# exo = department_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]
# # Use Auto ARIMA to determine optimal SARIMAX parameters
# auto_arima_model = pm.auto_arima(
# y,
# exogenous=exo,
# seasonal=True,
# m=365, # Assuming daily seasonality
# trace=True,
# error_action='ignore',
# suppress_warnings=True,
# stepwise=True,
# max_p=2, # maximum non-seasonal AR terms
# max_q=3, # maximum non-seasonal MA terms
# max_P=2, # maximum seasonal AR terms
# max_Q=3 # maximum seasonal MA terms
# )
# # Extract the best parameters
# optimal_order = auto_arima_model.order
# optimal_seasonal_order = auto_arima_model.seasonal_order
# optimal_order, optimal_seasonal_order
Since having daily seasonality will cause auto ARIMA to run forever, we use weekly seasonality instead:¶
# Define the endogenous (target) variable and exogenous variables
y = department_daily.set_index('date')['revenue']
exo = department_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]
# Use Auto ARIMA to determine optimal SARIMAX parameters
auto_arima_model = pm.auto_arima(
y,
exogenous=exo,
stationary=True, # Tells auto_arima that the series is already stationary
seasonal=True,
m=7, # Assuming weekly seasonality
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True
)
# Extract the best parameters
optimal_order = auto_arima_model.order
optimal_seasonal_order = auto_arima_model.seasonal_order
optimal_order, optimal_seasonal_order
Performing stepwise search to minimize aic ARIMA(2,0,2)(1,0,1)[7] intercept : AIC=26357.188, Time=4.05 sec ARIMA(0,0,0)(0,0,0)[7] intercept : AIC=27958.448, Time=0.03 sec ARIMA(1,0,0)(1,0,0)[7] intercept : AIC=26683.636, Time=1.37 sec ARIMA(0,0,1)(0,0,1)[7] intercept : AIC=26872.441, Time=0.76 sec ARIMA(0,0,0)(0,0,0)[7] : AIC=32700.985, Time=0.02 sec ARIMA(2,0,2)(0,0,1)[7] intercept : AIC=26817.098, Time=1.68 sec ARIMA(2,0,2)(1,0,0)[7] intercept : AIC=26473.888, Time=2.71 sec ARIMA(2,0,2)(2,0,1)[7] intercept : AIC=inf, Time=4.48 sec ARIMA(2,0,2)(1,0,2)[7] intercept : AIC=inf, Time=5.44 sec ARIMA(2,0,2)(0,0,0)[7] intercept : AIC=27219.512, Time=0.94 sec ARIMA(2,0,2)(0,0,2)[7] intercept : AIC=26635.132, Time=2.41 sec ARIMA(2,0,2)(2,0,0)[7] intercept : AIC=26366.907, Time=5.39 sec ARIMA(2,0,2)(2,0,2)[7] intercept : AIC=inf, Time=6.02 sec ARIMA(1,0,2)(1,0,1)[7] intercept : AIC=inf, Time=3.33 sec ARIMA(2,0,1)(1,0,1)[7] intercept : AIC=inf, Time=1.51 sec ARIMA(3,0,2)(1,0,1)[7] intercept : AIC=inf, Time=3.66 sec ARIMA(2,0,3)(1,0,1)[7] intercept : AIC=26181.941, Time=3.63 sec ARIMA(2,0,3)(0,0,1)[7] intercept : AIC=26856.728, Time=2.07 sec ARIMA(2,0,3)(1,0,0)[7] intercept : AIC=26373.643, Time=2.34 sec ARIMA(2,0,3)(2,0,1)[7] intercept : AIC=inf, Time=5.54 sec ARIMA(2,0,3)(1,0,2)[7] intercept : AIC=inf, Time=6.28 sec ARIMA(2,0,3)(0,0,0)[7] intercept : AIC=27329.799, Time=1.26 sec ARIMA(2,0,3)(0,0,2)[7] intercept : AIC=26632.885, Time=3.52 sec ARIMA(2,0,3)(2,0,0)[7] intercept : AIC=26232.591, Time=4.99 sec ARIMA(2,0,3)(2,0,2)[7] intercept : AIC=26161.978, Time=6.82 sec ARIMA(1,0,3)(2,0,2)[7] intercept : AIC=inf, Time=6.04 sec ARIMA(3,0,3)(2,0,2)[7] intercept : AIC=inf, Time=9.38 sec ARIMA(2,0,4)(2,0,2)[7] intercept : AIC=inf, Time=9.49 sec ARIMA(1,0,2)(2,0,2)[7] intercept : AIC=inf, Time=6.60 sec ARIMA(1,0,4)(2,0,2)[7] intercept : AIC=inf, Time=8.22 sec ARIMA(3,0,2)(2,0,2)[7] intercept : AIC=inf, Time=6.03 sec ARIMA(3,0,4)(2,0,2)[7] intercept : AIC=inf, Time=10.39 sec ARIMA(2,0,3)(2,0,2)[7] : AIC=inf, Time=6.51 sec Best model: ARIMA(2,0,3)(2,0,2)[7] intercept Total fit time: 142.937 seconds
((2, 0, 3), (2, 0, 2, 7))
# Applying AUTO_ARIMA model to get p and q automatically
auto_arima_model.summary()
| Dep. Variable: | y | No. Observations: | 1461 |
|---|---|---|---|
| Model: | SARIMAX(2, 0, 3)x(2, 0, [1, 2], 7) | Log Likelihood | -13069.989 |
| Date: | Wed, 12 Mar 2025 | AIC | 26161.978 |
| Time: | 19:25:48 | BIC | 26220.133 |
| Sample: | 01-01-2012 | HQIC | 26183.671 |
| - 12-31-2015 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 7842.3152 | 7732.518 | 1.014 | 0.310 | -7313.141 | 2.3e+04 |
| ar.L1 | -1.6668 | 0.052 | -32.317 | 0.000 | -1.768 | -1.566 |
| ar.L2 | -0.9272 | 0.051 | -18.150 | 0.000 | -1.027 | -0.827 |
| ma.L1 | 2.0575 | 0.058 | 35.404 | 0.000 | 1.944 | 2.171 |
| ma.L2 | 1.5678 | 0.089 | 17.668 | 0.000 | 1.394 | 1.742 |
| ma.L3 | 0.3439 | 0.039 | 8.807 | 0.000 | 0.267 | 0.420 |
| ar.S.L7 | 0.2302 | 1.583 | 0.145 | 0.884 | -2.873 | 3.333 |
| ar.S.L14 | 0.6439 | 1.462 | 0.440 | 0.660 | -2.221 | 3.509 |
| ma.S.L7 | 0.1127 | 1.573 | 0.072 | 0.943 | -2.971 | 3.196 |
| ma.S.L14 | -0.4179 | 0.910 | -0.459 | 0.646 | -2.201 | 1.365 |
| sigma2 | 4.262e+06 | 5.207 | 8.19e+05 | 0.000 | 4.26e+06 | 4.26e+06 |
| Ljung-Box (L1) (Q): | 6.70 | Jarque-Bera (JB): | 7617.82 |
|---|---|---|---|
| Prob(Q): | 0.01 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 1.22 | Skew: | -1.00 |
| Prob(H) (two-sided): | 0.03 | Kurtosis: | 14.01 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 8.7e+21. Standard errors may be unstable.
# # Define the orders based on the auto_arima output:
# order = (1, 0, 1)
# seasonal_order = (1, 0, 1, 7)
# # Initialize and fit the SARIMAX model
# model = SARIMAX(weekly_df['num_orders'], order=order, seasonal_order=seasonal_order)
# results = model.fit()
# print(results.summary())
# Fit the SARIMAX model
sarimax_model = SARIMAX(y,
exog=exo,
order=optimal_order,
seasonal_order=optimal_seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
sarimax_results = sarimax_model.fit()
# Display model summary
sarimax_results.summary()
| Dep. Variable: | revenue | No. Observations: | 1461 |
|---|---|---|---|
| Model: | SARIMAX(2, 0, 3)x(2, 0, [1, 2], 7) | Log Likelihood | -13353.446 |
| Date: | Wed, 12 Mar 2025 | AIC | 26736.892 |
| Time: | 19:26:12 | BIC | 26816.009 |
| Sample: | 01-01-2012 | HQIC | 26766.423 |
| - 12-31-2015 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| event1_Christmas | 10.3751 | 336.898 | 0.031 | 0.975 | -649.932 | 670.682 |
| event1_Thanksgiving | 1.115e+04 | 340.184 | 32.785 | 0.000 | 1.05e+04 | 1.18e+04 |
| event1_NewYear | 1.131e+04 | 573.941 | 19.702 | 0.000 | 1.02e+04 | 1.24e+04 |
| event1_SuperBowl | 2.133e+04 | 390.929 | 54.554 | 0.000 | 2.06e+04 | 2.21e+04 |
| event1_LaborDay | 2.245e+04 | 275.143 | 81.594 | 0.000 | 2.19e+04 | 2.3e+04 |
| ar.L1 | 1.4789 | 0.367 | 4.033 | 0.000 | 0.760 | 2.198 |
| ar.L2 | -0.4870 | 0.359 | -1.355 | 0.175 | -1.191 | 0.217 |
| ma.L1 | -1.2754 | 0.368 | -3.466 | 0.001 | -1.997 | -0.554 |
| ma.L2 | 0.3446 | 0.287 | 1.200 | 0.230 | -0.218 | 0.907 |
| ma.L3 | -0.0221 | 0.053 | -0.415 | 0.678 | -0.126 | 0.082 |
| ar.S.L7 | 0.8671 | 0.259 | 3.348 | 0.001 | 0.360 | 1.375 |
| ar.S.L14 | 0.1335 | 0.259 | 0.515 | 0.606 | -0.374 | 0.641 |
| ma.S.L7 | -0.7419 | 0.263 | -2.824 | 0.005 | -1.257 | -0.227 |
| ma.S.L14 | -0.2315 | 0.251 | -0.923 | 0.356 | -0.723 | 0.260 |
| sigma2 | 8.145e+06 | 0.970 | 8.4e+06 | 0.000 | 8.14e+06 | 8.14e+06 |
| Ljung-Box (L1) (Q): | 0.01 | Jarque-Bera (JB): | 41086.87 |
|---|---|---|---|
| Prob(Q): | 0.94 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 1.02 | Skew: | -3.58 |
| Prob(H) (two-sided): | 0.80 | Kurtosis: | 28.14 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 7.9e+20. Standard errors may be unstable.
# Plot diagnostic plots (residuals, etc.)
sarimax_results.plot_diagnostics(figsize=(15, 12))
plt.show()
Residual Randomness: The standardized residuals seem fairly random and centered around zero
Normality: The histogram and Q-Q plot suggest residuals are not drastically non-normal. We might see slight deviations in the tails (common in real-world data)
No Major Autocorrelation: The correlogram shows no large spikes outside the confidence intervals, indicating the model has likely captured the time dependencies
These diagnostics collectively suggest the model is doing a reasonably good job¶
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Actual', marker='o')
plt.plot(y.index, sarimax_results.fittedvalues, label='Fitted', marker='x', color='red')
plt.title('Actual vs Fitted Values')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.legend()
plt.show()
# Actual values (endogenous variable)
y_true = y # or test set if we're evaluating out-of-sample
# In-sample fitted values from SARIMAX
y_pred = sarimax_results.fittedvalues
# Calculate MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f'MAPE: {mape:.2f}%')
MAPE: 404.34%
Model Evaluation¶
(1) Train-Test Validation¶
# Split data into train and test sets (80-20 split)
train_size = int(len(y) * 0.8)
train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]
# Refit SARIMAX model on training data
sarimax_model_train = SARIMAX(train_y,
exog=train_exo,
order=optimal_order,
seasonal_order=optimal_seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
sarimax_results_train = sarimax_model_train.fit()
# Forecast on test set
Foods3_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)
(2) Check Distribution of MAPE¶
# Extract the actual and forecast values from the DataFrame
actual_values = test_y.values
forecast_values = Foods3_predictions
# Ensure both are one-dimensional arrays
actual_values = np.squeeze(actual_values)
forecast_values = np.squeeze(forecast_values)
# Calculate the MAPE for each row
mape_row_by_row = np.abs((actual_values - forecast_values) / actual_values) * 100
# Create a DataFrame to display the row-by-row MAPE along with the actual and forecast values
df_results = pd.DataFrame({
'Actual': actual_values,
'Forecast': forecast_values,
'MAPE': mape_row_by_row
})
# Calculate the overall weighted MAPE
weights = actual_values / actual_values.sum()
overall_mape = np.sum(mape_row_by_row * weights)
# Display the DataFrame
print(df_results)
# Print the overall MAPE
print(f"Overall MAPE: {overall_mape:.2f}%")
Actual Forecast MAPE 2015-03-14 20577.9210 20980.513741 1.956431 2015-03-15 24262.6775 21910.661523 9.693967 2015-03-16 16995.1786 17126.601868 0.773297 2015-03-17 14918.4433 15536.951373 4.145929 2015-03-18 13953.3844 15063.428980 7.955379 ... ... ... ... 2015-12-27 16427.9190 22474.081068 36.804187 2015-12-28 15311.9904 17717.676021 15.711123 2015-12-29 14675.1821 16146.088998 10.023091 2015-12-30 16201.4307 15724.033126 2.946638 2015-12-31 20203.9008 14867.015155 26.415125 [293 rows x 3 columns] Overall MAPE: 8.66%
(3) Check Overestimate/Underestimate Forecasting & Predicted/Actual Pattern Match¶
# Plot predicted results 1
plt.figure(figsize=(12, 6))
plt.plot(test_y.index, test_y, label="Actual Revenue", color='blue', linewidth=2)
plt.plot(test_y.index, Foods3_predictions, label="Predicted Revenue", color='red', linestyle="dashed", linewidth=2)
plt.xlabel(" ")
plt.ylabel("Revenue")
plt.title("State Level (CA) Foods3 Department Revenue (2015-2016) Forecasting with 5 Major Events")
plt.legend()
plt.grid(True)
plt.show()
# Plot predicted results 2
title = 'State Level (CA) Foods3 Department Revenue (2015-2016) Forecasting with 5 Major Events'
ylabel= 'Revenue'
xlabel= '' # we don't really need a label here
ax = test_y.plot(legend=True, figsize=(12, 6), title=title)
Foods3_predictions.plot(legend=True, ax=ax) # Make sure predictions plot on the same axis
ax.autoscale(axis='x', tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel)
# Set y-axis to start at zero
ax.set_ylim(bottom=0) # This automatically adjusts the upper limit as needed
plt.show()
(4) Check Evaluation Metrics¶
# Compute evaluation metrics
mse = mean_squared_error(test_y, Foods3_predictions)
rmse = np.sqrt(mean_squared_error(test_y, Foods3_predictions))
mae = mean_absolute_error(test_y, Foods3_predictions)
mape = np.mean(np.abs((test_y - Foods3_predictions) / test_y)) * 100
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)
print(evaluation_results)
Mean Squared Error (MSE): 6068119.54 Root Mean Squared Error (RMSE): 2463.36 Mean Absolute Error (MAE): 1553.20 Mean Absolute Percentage Error (MAPE): 525.80%
mape_value = mean_absolute_percentage_error(
test_y,
Foods3_predictions
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 525.80%
2. Top-Down Approach¶
Recall Revenue Percentage by Store (All Time 2012/1/1-2015/12/31)¶
CA_1: 29%
CA_2: 17%
CA_3: 39%
CA_4: 15%
Since the above is store revenue % of all time, We need store revenue % of same forecast period last year (2014/03/14 -2014/12/31)¶
(Notice we shouldn't use store revenue % of actual forecast period, since we assume it is future data that we don't know currently, using that data will cause data leakage though we have it)¶
# 1. Define the date range
start_date = '2014-03-14'
end_date = '2014-12-31'
# 2. Filter rows to the specified date range
df_filtered = Foods3_CA_df[
(Foods3_CA_df['date'] >= start_date) &
(Foods3_CA_df['date'] <= end_date)
]
# 3. Group by 'store_id' and sum 'revenue'
store_revenue_sums = df_filtered.groupby('store_id')['revenue'].sum()
# 4. Calculate the total revenue across all stores
total_revenue_period = store_revenue_sums.sum()
# 5. Calculate each store's share of total revenue
store_shares = store_revenue_sums / total_revenue_period
# 6. Display the results
store_shares.round(2)
store_id CA_1 0.29 CA_2 0.15 CA_3 0.41 CA_4 0.15 Name: revenue, dtype: float64
Revenue Percentage by Store (2014/3/14-2014/12/31)¶
CA_1: 29%
CA_2: 15%
CA_3: 41%
CA_4: 15%
(1) CA_1 Store Top-Down¶
- Calculate CA_1 Predicted Revenue
# Extract CA_1 percentage
store_CA_1 = store_shares.loc['CA_1']
# Get CA_1 store revenue forecast
forecast_CA_1 = Foods3_predictions * store_CA_1
forecast_CA_1
2015-03-14 6127.013528
2015-03-15 6398.647870
2015-03-16 5001.542032
2015-03-17 4537.310784
2015-03-18 4399.026367
...
2015-12-27 6563.185270
2015-12-28 5174.155505
2015-12-29 4715.199396
2015-12-30 4591.944929
2015-12-31 4341.666944
Freq: D, Name: predicted_mean, Length: 293, dtype: float64
- Get CA_1 Actual Revenue
# 1. Filter for store_id == 'CA_1'
df_CA1 = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_1'].copy()
# 2. Restrict to desired date range (2015-03-14 to 2015-12-31)
start_date = '2015-03-14'
end_date = '2015-12-31'
mask = (df_CA1['date'] >= start_date) & (df_CA1['date'] <= end_date)
df_CA1_filtered = df_CA1.loc[mask]
# 3. Select only 'date' and 'revenue', then sort by date
actual_CA_1 = df_CA1_filtered[['date', 'revenue']].sort_values('date')
# 4. Set date column as index
actual_CA_1.set_index('date', inplace=True)
# 5. Print time series of actual revenue
actual_CA_1
| revenue | |
|---|---|
| date | |
| 2015-03-14 | 6141.6230 |
| 2015-03-15 | 7154.1210 |
| 2015-03-16 | 4833.7110 |
| 2015-03-17 | 3950.7646 |
| 2015-03-18 | 4107.2850 |
| ... | ... |
| 2015-12-27 | 4446.1396 |
| 2015-12-28 | 4131.3706 |
| 2015-12-29 | 4129.5820 |
| 2015-12-30 | 4345.4190 |
| 2015-12-31 | 5333.0360 |
293 rows × 1 columns
- Plot CA_1 Actual VS Forecast
# Plot actual VS forecast
plt.figure(figsize=(10, 6))
plt.plot(actual_CA_1.index, actual_CA_1, label='Actual CA_1 Revenue')
plt.plot(forecast_CA_1.index, forecast_CA_1, label='Forecast CA_1 Revenue')
# Add labels and legend
plt.xlabel('')
plt.ylabel('Revenue')
plt.title('State Level (CA) CA_1 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events')
plt.legend()
# Show the plot
plt.show()
- Evaluate CA_1 Top-Down Forecast Result
# Compute evaluation metrics
mse = mean_squared_error(actual_CA_1, forecast_CA_1)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_CA_1, forecast_CA_1)
mape = np.mean(np.abs((actual_CA_1 - forecast_CA_1) / actual_CA_1)) * 100
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)
print(evaluation_results)
Mean Squared Error (MSE): 674801.37 Root Mean Squared Error (RMSE): 821.46 Mean Absolute Error (MAE): 559.64 Mean Absolute Percentage Error (MAPE): nan%
mape_value = mean_absolute_percentage_error(
actual_CA_1,
forecast_CA_1
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 7846370137560414208.00%
(2) CA_2 Store Top-Down¶
- Calculate CA_2 Predicted Revenue
# Extract CA_2 percentage
store_CA_2 = store_shares.loc['CA_2']
# Get CA_2 store revenue forecast
forecast_CA_2 = Foods3_predictions * store_CA_2
forecast_CA_2
2015-03-14 3196.456026
2015-03-15 3338.167355
2015-03-16 2609.298820
2015-03-17 2367.109903
2015-03-18 2294.967079
...
2015-12-27 3424.006330
2015-12-28 2699.351073
2015-12-29 2459.914191
2015-12-30 2395.612475
2015-12-31 2265.042733
Freq: D, Name: predicted_mean, Length: 293, dtype: float64
- Get CA_2 Actual Revenue
# 1. Filter for store_id == 'CA_2'
df_CA2 = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_2'].copy()
# 2. Restrict to desired date range (2015-03-14 to 2015-12-31)
start_date = '2015-03-14'
end_date = '2015-12-31'
mask = (df_CA2['date'] >= start_date) & (df_CA2['date'] <= end_date)
df_CA2_filtered = df_CA2.loc[mask]
# 3. Select only 'date' and 'revenue', then sort by date
actual_CA_2 = df_CA2_filtered[['date', 'revenue']].sort_values('date')
# 4. Set date column as index
actual_CA_2.set_index('date', inplace=True)
# 5. Print time series of actual revenue
actual_CA_2
| revenue | |
|---|---|
| date | |
| 2015-03-14 | 3014.8530 |
| 2015-03-15 | 3557.4578 |
| 2015-03-16 | 1831.1394 |
| 2015-03-17 | 1876.0859 |
| 2015-03-18 | 1772.1809 |
| ... | ... |
| 2015-12-27 | 4265.5730 |
| 2015-12-28 | 3385.6113 |
| 2015-12-29 | 3380.6145 |
| 2015-12-30 | 3926.1920 |
| 2015-12-31 | 5141.0130 |
293 rows × 1 columns
- Plot CA_2 Actual VS Forecast
# Plot actual VS forecast
plt.figure(figsize=(10, 6))
plt.plot(actual_CA_2.index, actual_CA_2, label='Actual CA_2 Revenue')
plt.plot(forecast_CA_2.index, forecast_CA_2, label='Forecast CA_2 Revenue')
# Add labels and legend
plt.xlabel('')
plt.ylabel('Revenue')
plt.title('State Level (CA) CA_2 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events')
plt.legend()
# Show the plot
plt.show()
- Evaluate CA_2 Top-Down Forecast Result
# Compute evaluation metrics
mse = mean_squared_error(actual_CA_2, forecast_CA_2)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_CA_2, forecast_CA_2)
mape = np.mean(np.abs((actual_CA_2 - forecast_CA_2) / actual_CA_2)) * 100
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)
print(evaluation_results)
Mean Squared Error (MSE): 1572627.38 Root Mean Squared Error (RMSE): 1254.04 Mean Absolute Error (MAE): 1021.30 Mean Absolute Percentage Error (MAPE): nan%
mape_value = mean_absolute_percentage_error(
actual_CA_2,
forecast_CA_2
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 281.42%
(3) CA_3 Store Top-Down¶
- Calculate CA_3 Predicted Revenue
# Extract CA_3 percentage
store_CA_3 = store_shares.loc['CA_3']
# Get CA_3 store revenue forecast
forecast_CA_3 = Foods3_predictions * store_CA_3
forecast_CA_3
2015-03-14 8538.410280
2015-03-15 8916.951219
2015-03-16 6969.989165
2015-03-17 6323.051330
2015-03-18 6130.342585
...
2015-12-27 9146.245282
2015-12-28 7210.537785
2015-12-29 6570.951216
2015-12-30 6399.187729
2015-12-31 6050.408327
Freq: D, Name: predicted_mean, Length: 293, dtype: float64
- Get CA_3 Actual Revenue
# 1. Filter for store_id == 'CA_3'
df_CA3 = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_3'].copy()
# 2. Restrict to desired date range (2015-03-14 to 2015-12-31)
start_date = '2015-03-14'
end_date = '2015-12-31'
mask = (df_CA3['date'] >= start_date) & (df_CA3['date'] <= end_date)
df_CA3_filtered = df_CA3.loc[mask]
# 3. Select only 'date' and 'revenue', then sort by date
actual_CA_3 = df_CA3_filtered[['date', 'revenue']].sort_values('date')
# 4. Set date column as index
actual_CA_3.set_index('date', inplace=True)
# 5. Print time series of actual revenue
actual_CA_3
| revenue | |
|---|---|
| date | |
| 2015-03-14 | 8467.2440 |
| 2015-03-15 | 10116.2220 |
| 2015-03-16 | 7627.6895 |
| 2015-03-17 | 6562.2295 |
| 2015-03-18 | 5713.6250 |
| ... | ... |
| 2015-12-27 | 5824.9710 |
| 2015-12-28 | 5684.8975 |
| 2015-12-29 | 5417.0146 |
| 2015-12-30 | 5948.0480 |
| 2015-12-31 | 6914.9814 |
293 rows × 1 columns
- Plot CA_3 Actual VS Forecast
# Plot actual VS forecast
plt.figure(figsize=(10, 6))
plt.plot(actual_CA_3.index, actual_CA_3, label='Actual CA_3 Revenue')
plt.plot(forecast_CA_3.index, forecast_CA_3, label='Forecast CA_3 Revenue')
# Add labels and legend
plt.xlabel('')
plt.ylabel('Revenue')
plt.title('State Level (CA) CA_3 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events')
plt.legend()
# Show the plot
plt.show()
- Evaluate CA_3 Top-Down Forecast Result
# Compute evaluation metrics
mse = mean_squared_error(actual_CA_3, forecast_CA_3)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_CA_3, forecast_CA_3)
mape = np.mean(np.abs((actual_CA_3 - forecast_CA_3) / actual_CA_3)) * 100
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)
print(evaluation_results)
Mean Squared Error (MSE): 1469628.02 Root Mean Squared Error (RMSE): 1212.28 Mean Absolute Error (MAE): 819.96 Mean Absolute Percentage Error (MAPE): nan%
mape_value = mean_absolute_percentage_error(
actual_CA_3,
forecast_CA_3
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 317.09%
(4) CA_4 Store Top-Down¶
- Calculate CA_4 Predicted Revenue
# Extract CA_4 percentage
store_CA_4 = store_shares.loc['CA_4']
# Get CA_4 store revenue forecast
forecast_CA_4 = Foods3_predictions * store_CA_4
forecast_CA_4
2015-03-14 3118.633907
2015-03-15 3256.895079
2015-03-16 2545.771851
2015-03-17 2309.479356
2015-03-18 2239.092948
...
2015-12-27 3340.644186
2015-12-28 2633.631658
2015-12-29 2400.024196
2015-12-30 2337.287993
2015-12-31 2209.897152
Freq: D, Name: predicted_mean, Length: 293, dtype: float64
- Get CA_4 Actual Revenue
# 1. Filter for store_id == 'CA_4'
df_CA4 = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_4'].copy()
# 2. Restrict to desired date range (2015-03-14 to 2015-12-31)
start_date = '2015-03-14'
end_date = '2015-12-31'
mask = (df_CA4['date'] >= start_date) & (df_CA4['date'] <= end_date)
df_CA4_filtered = df_CA4.loc[mask]
# 3. Select only 'date' and 'revenue', then sort by date
actual_CA_4 = df_CA4_filtered[['date', 'revenue']].sort_values('date')
# 4. Set date column as index
actual_CA_4.set_index('date', inplace=True)
# 5. Print time series of actual revenue
actual_CA_4
| revenue | |
|---|---|
| date | |
| 2015-03-14 | 2954.2010 |
| 2015-03-15 | 3434.8767 |
| 2015-03-16 | 2702.6387 |
| 2015-03-17 | 2529.3633 |
| 2015-03-18 | 2360.2935 |
| ... | ... |
| 2015-12-27 | 1891.2354 |
| 2015-12-28 | 2110.1110 |
| 2015-12-29 | 1747.9710 |
| 2015-12-30 | 1981.7717 |
| 2015-12-31 | 2814.8704 |
293 rows × 1 columns
- Plot CA_4 Actual VS Forecast
# Plot actual VS forecast
plt.figure(figsize=(10, 6))
plt.plot(actual_CA_4.index, actual_CA_4, label='Actual CA_4 Revenue')
plt.plot(forecast_CA_4.index, forecast_CA_4, label='Forecast CA_4 Revenue')
# Add labels and legend
plt.xlabel('')
plt.ylabel('Revenue')
plt.title('State Level (CA) CA_4 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events')
plt.legend()
# Show the plot
plt.show()
- Evaluate CA_4 Top-Down Forecast Result
# Compute evaluation metrics
mse = mean_squared_error(actual_CA_4, forecast_CA_4)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_CA_4, forecast_CA_4)
mape = np.mean(np.abs((actual_CA_4 - forecast_CA_4) / actual_CA_4)) * 100
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)
print(evaluation_results)
Mean Squared Error (MSE): 193354.44 Root Mean Squared Error (RMSE): 439.72 Mean Absolute Error (MAE): 291.09 Mean Absolute Percentage Error (MAPE): nan%
mape_value = mean_absolute_percentage_error(
actual_CA_4,
forecast_CA_4
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 3993781937241965568.00%
3. CA_1, CA_2, CA_3, CA_4 Store Revenue Forecasting¶
# Filter for store_id == 'CA_1'
CA_1_daily = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_1'].copy()
CA_1_daily
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | week | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1348 | 2012-01-01 | CA_1 | 1448 | 1.645839 | 3213.3958 | 1.0 | 0 | 0 | 1 | 0 | 0 | 2011-12-26/2012-01-01 | 2012-01 | 2012 |
| 1352 | 2012-01-02 | CA_1 | 2001 | 1.645839 | 4684.9550 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1359 | 2012-01-03 | CA_1 | 1711 | 1.645839 | 3916.0360 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1363 | 2012-01-04 | CA_1 | 1598 | 1.645839 | 3606.1812 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1364 | 2012-01-05 | CA_1 | 1949 | 1.645839 | 4310.1980 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7172 | 2015-12-27 | CA_1 | 1669 | 2.921556 | 4446.1396 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-21/2015-12-27 | 2015-12 | 2015 |
| 7176 | 2015-12-28 | CA_1 | 1597 | 2.921556 | 4131.3706 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7180 | 2015-12-29 | CA_1 | 1579 | 2.921556 | 4129.5820 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7186 | 2015-12-30 | CA_1 | 1582 | 2.921556 | 4345.4190 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7189 | 2015-12-31 | CA_1 | 1940 | 2.921556 | 5333.0360 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
1461 rows × 14 columns
# Plot daily revenue trend
plt.figure(figsize=(12,6))
plt.plot(CA_1_daily['date'], CA_1_daily['revenue'], label='Daily Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('CA_1 Store Daily Revenue Trend Over Time')
plt.legend()
plt.grid(True)
plt.show()
# Plot histograms for each feature (Ignore first plot - date column)
CA_1_daily.iloc[:, 1:].hist(bins=30, figsize=(12, 10))
plt.tight_layout()
plt.show()
CA_1_daily.describe()
| date | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 1461 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 |
| mean | 2013-12-31 00:00:00 | 2160.395619 | 2.418859 | 5031.478264 | 0.328542 | 0.002738 | 0.002738 | 0.002738 | 0.002738 | 0.002738 |
| min | 2012-01-01 00:00:00 | 0.000000 | 1.645839 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2012-12-31 00:00:00 | 1755.000000 | 2.025482 | 4131.370600 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 2013-12-31 00:00:00 | 2067.000000 | 2.425668 | 4798.596700 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 2014-12-31 00:00:00 | 2515.000000 | 2.784499 | 5833.535000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 2015-12-31 00:00:00 | 4042.000000 | 2.921895 | 8927.261000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| std | NaN | 536.165123 | 0.399183 | 1251.038288 | 0.469844 | 0.052271 | 0.052271 | 0.052271 | 0.052271 | 0.052271 |
CA_1_daily.shape
(1461, 14)
# Apply decomposition (Additive model for clearer seasonality)
decomposition = seasonal_decompose(CA_1_daily.set_index('date')['revenue'], model='additive', period=365)
# Plot decomposition results
plt.figure(figsize=(12, 8))
fig = decomposition.plot()
fig.axes[0].set_title("")
plt.suptitle(f"Seasonal Decomposition of Revenue - CA_1 Store", fontsize=14)
plt.show()
<Figure size 1200x800 with 0 Axes>
# Extract seasonal component
seasonality_strength = np.var(decomposition.seasonal) / np.var(CA_1_daily['revenue'])
# Display the seasonality strength
seasonality_strength
0.5651099818207891
Roughly 57% of the variance in CA_1 store revenue series is explained by seasonality, indicating a strong degree of seasonality¶
# Autocorrelation and Partial Autocorrelation plots
plt.figure(figsize=(12,4))
plot_acf(CA_1_daily['revenue'], lags=50, zero=False, ax=plt.gca()) # Autocorrelation from lag 1
plt.title('Autocorrelation of Revenue (starting from lag 1)')
plt.show()
plt.figure(figsize=(12,4))
plot_pacf(CA_1_daily['revenue'], lags=50, zero=False, ax=plt.gca()) # Partial Autocorrelation from lag 1
plt.title('Partial Autocorrelation of Revenue (starting from lag 1)')
plt.show()
PACF is p or AR, it is pretty obvious that p is 1
ACF is q or MA, it is not obvious, need auto ARIMA to further determine q
Begin Modeling¶
- Apply ADF test to check stationarity
# Automate Augmented Dickey-Fuller (ADF) test for stationarity
adf_test(CA_1_daily['revenue'])
Augmented Dickey-Fuller Test: ADF test statistic -3.828987 p-value 0.002624 # lags used 24.000000 # observations 1436.000000 critical value (1%) -3.434912 critical value (5%) -2.863555 critical value (10%) -2.567843 Strong evidence against the null hypothesis Reject the null hypothesis Data has no unit root and is stationary
Since it is a stationary data, we don't need to difference it to achieve stationary, skip the below differencing steps
# # Apply first-order differencing
# CA_1_daily['revenue_diff'] = CA_1_daily['revenue'].diff()
# # Perform ADF test again on the differenced series
# adf_test(CA_1_daily['revenue_diff'])
Augmented Dickey-Fuller Test: ADF test statistic -20.042742 p-value 0.000000 # lags used 26.000000 # observations 1913.000000 critical value (1%) -3.433773 critical value (5%) -2.863052 critical value (10%) -2.567575 Strong evidence against the null hypothesis Reject the null hypothesis Data has no unit root and is stationary
# # Plot the differenced revenue
# plt.figure(figsize=(12,6))
# plt.plot(CA_1_daily['date'], CA_1_daily['revenue_diff'], label='Differenced Revenue', color='purple')
# plt.xlabel('')
# plt.ylabel('Differenced Revenue')
# plt.title('First-Order Differencing of Revenue to Achieve Stationarity')
# plt.legend()
# plt.grid(True)
# plt.show()
- Use auto ARIMA to determine optimal parameters (p, d, q). Since p=1 from PACF, we force p=1 in auto ARIMA to reduce computing time
# Define the endogenous (target) variable and exogenous variables
y = CA_1_daily.set_index('date')['revenue']
exo = CA_1_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]
# Use Auto ARIMA to determine optimal SARIMAX parameters
auto_arima_model = pm.auto_arima(
y,
exogenous=exo,
stationary=True, # Tells auto_arima that the series is already stationary
seasonal=True,
m=7, # Assuming weekly seasonality
start_p=1, # Minimum AR order
max_p=1, # Maximum AR order (forces p = 1)
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True
)
# Extract the best parameters
optimal_order = auto_arima_model.order
optimal_seasonal_order = auto_arima_model.seasonal_order
optimal_order, optimal_seasonal_order
Performing stepwise search to minimize aic ARIMA(1,0,2)(1,0,1)[7] intercept : AIC=23614.447, Time=3.76 sec ARIMA(0,0,0)(0,0,0)[7] intercept : AIC=24988.051, Time=0.03 sec ARIMA(1,0,0)(1,0,0)[7] intercept : AIC=23557.927, Time=1.56 sec ARIMA(0,0,1)(0,0,1)[7] intercept : AIC=23932.189, Time=0.67 sec ARIMA(0,0,0)(0,0,0)[7] : AIC=29141.298, Time=0.02 sec ARIMA(1,0,0)(0,0,0)[7] intercept : AIC=24466.946, Time=0.07 sec ARIMA(1,0,0)(2,0,0)[7] intercept : AIC=23701.182, Time=4.73 sec ARIMA(1,0,0)(1,0,1)[7] intercept : AIC=23706.897, Time=1.61 sec ARIMA(1,0,0)(0,0,1)[7] intercept : AIC=23947.127, Time=0.66 sec ARIMA(1,0,0)(2,0,1)[7] intercept : AIC=23733.032, Time=3.63 sec ARIMA(0,0,0)(1,0,0)[7] intercept : AIC=23667.791, Time=1.20 sec ARIMA(1,0,1)(1,0,0)[7] intercept : AIC=inf, Time=1.43 sec ARIMA(0,0,1)(1,0,0)[7] intercept : AIC=23452.630, Time=1.45 sec ARIMA(0,0,1)(0,0,0)[7] intercept : AIC=24363.978, Time=0.13 sec ARIMA(0,0,1)(2,0,0)[7] intercept : AIC=24289.360, Time=2.22 sec ARIMA(0,0,1)(1,0,1)[7] intercept : AIC=23912.490, Time=1.67 sec ARIMA(0,0,1)(2,0,1)[7] intercept : AIC=23837.642, Time=3.70 sec ARIMA(0,0,2)(1,0,0)[7] intercept : AIC=24139.575, Time=1.44 sec ARIMA(1,0,2)(1,0,0)[7] intercept : AIC=23392.921, Time=1.64 sec ARIMA(1,0,2)(0,0,0)[7] intercept : AIC=24310.794, Time=0.28 sec ARIMA(1,0,2)(2,0,0)[7] intercept : AIC=inf, Time=3.60 sec ARIMA(1,0,2)(0,0,1)[7] intercept : AIC=23902.845, Time=1.31 sec ARIMA(1,0,2)(2,0,1)[7] intercept : AIC=23590.297, Time=4.73 sec ARIMA(1,0,3)(1,0,0)[7] intercept : AIC=23480.472, Time=2.31 sec ARIMA(0,0,3)(1,0,0)[7] intercept : AIC=24344.688, Time=2.06 sec ARIMA(1,0,2)(1,0,0)[7] : AIC=inf, Time=1.50 sec Best model: ARIMA(1,0,2)(1,0,0)[7] intercept Total fit time: 47.429 seconds
((1, 0, 2), (1, 0, 0, 7))
# Applying AUTO_ARIMA model to get p and q automatically
auto_arima_model.summary()
| Dep. Variable: | y | No. Observations: | 1461 |
|---|---|---|---|
| Model: | SARIMAX(1, 0, 2)x(1, 0, [], 7) | Log Likelihood | -11690.461 |
| Date: | Wed, 12 Mar 2025 | AIC | 23392.921 |
| Time: | 19:59:10 | BIC | 23424.643 |
| Sample: | 01-01-2012 | HQIC | 23404.754 |
| - 12-31-2015 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 2791.5405 | 182.763 | 15.274 | 0.000 | 2433.331 | 3149.750 |
| ar.L1 | -0.9064 | 0.027 | -33.836 | 0.000 | -0.959 | -0.854 |
| ma.L1 | 1.3494 | 0.030 | 45.041 | 0.000 | 1.291 | 1.408 |
| ma.L2 | 0.4637 | 0.019 | 23.914 | 0.000 | 0.426 | 0.502 |
| ar.S.L7 | 0.7090 | 0.018 | 39.561 | 0.000 | 0.674 | 0.744 |
| sigma2 | 5.203e+05 | 1.16e+04 | 44.873 | 0.000 | 4.98e+05 | 5.43e+05 |
| Ljung-Box (L1) (Q): | 0.15 | Jarque-Bera (JB): | 862.43 |
|---|---|---|---|
| Prob(Q): | 0.70 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 1.10 | Skew: | -0.16 |
| Prob(H) (two-sided): | 0.30 | Kurtosis: | 6.75 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
- Apply auto ARIMA results to fit SARIMAX model
# Fit the SARIMAX model
sarimax_model = SARIMAX(y,
exog=exo,
order=optimal_order,
seasonal_order=optimal_seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
sarimax_results = sarimax_model.fit()
# Display model summary
sarimax_results.summary()
| Dep. Variable: | revenue | No. Observations: | 1461 |
|---|---|---|---|
| Model: | SARIMAX(1, 0, 2)x(1, 0, [], 7) | Log Likelihood | -12127.716 |
| Date: | Wed, 12 Mar 2025 | AIC | 24275.431 |
| Time: | 19:59:16 | BIC | 24328.245 |
| Sample: | 01-01-2012 | HQIC | 24295.137 |
| - 12-31-2015 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| event1_Christmas | -7.325e-05 | 185.983 | -3.94e-07 | 1.000 | -364.520 | 364.519 |
| event1_Thanksgiving | 3108.4775 | 182.275 | 17.054 | 0.000 | 2751.226 | 3465.729 |
| event1_NewYear | 3386.6197 | 362.733 | 9.336 | 0.000 | 2675.676 | 4097.563 |
| event1_SuperBowl | 6335.3692 | 157.486 | 40.228 | 0.000 | 6026.701 | 6644.037 |
| event1_LaborDay | 6846.0455 | 168.361 | 40.663 | 0.000 | 6516.065 | 7176.026 |
| ar.L1 | 1.0000 | 0.000 | 2806.764 | 0.000 | 0.999 | 1.001 |
| ma.L1 | -0.7424 | 0.041 | -18.029 | 0.000 | -0.823 | -0.662 |
| ma.L2 | -0.2460 | 0.041 | -5.989 | 0.000 | -0.326 | -0.165 |
| ar.S.L7 | 0.6160 | 0.026 | 23.809 | 0.000 | 0.565 | 0.667 |
| sigma2 | 1.615e+06 | 0.772 | 2.09e+06 | 0.000 | 1.61e+06 | 1.61e+06 |
| Ljung-Box (L1) (Q): | 0.54 | Jarque-Bera (JB): | 5580.17 |
|---|---|---|---|
| Prob(Q): | 0.46 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 0.92 | Skew: | -0.81 |
| Prob(H) (two-sided): | 0.35 | Kurtosis: | 12.46 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 7.27e+20. Standard errors may be unstable.
# Plot diagnostic plots (residuals, etc.)
sarimax_results.plot_diagnostics(figsize=(15, 12))
plt.show()
- Plot actual and predicted trend
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Actual', marker='o')
plt.plot(y.index, sarimax_results.fittedvalues, label='Fitted', marker='x', color='red')
plt.title('Actual vs Fitted Values')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.legend()
plt.show()
- Evaluate predicted result using MAPE
# Actual values (endogenous variable)
y_true = y # or test set if we're evaluating out-of-sample
# In-sample fitted values from SARIMAX
y_pred = sarimax_results.fittedvalues
# Calculate MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f'MAPE: {mape:.2f}%')
MAPE: inf%
mape_value = mean_absolute_percentage_error(
y_true,
y_pred
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 5529745410436780032.00%
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
)
print(evaluation_results)
Mean Squared Error (MSE): 925908.46 Root Mean Squared Error (RMSE): 962.24 Mean Absolute Error (MAE): 641.76
Model Evaluation¶
- Train-Test Validation
# Split data into train and test sets (80-20 split)
train_size = int(len(y) * 0.8)
train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]
# Refit SARIMAX model on training data
sarimax_model_train = SARIMAX(train_y,
exog=train_exo,
order=optimal_order,
seasonal_order=optimal_seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
sarimax_results_train = sarimax_model_train.fit()
# Forecast on test set
CA_1_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)
- Check Distribution of MAPE
# Extract the actual and forecast values from the DataFrame
actual_values = test_y.values
forecast_values = CA_1_predictions
# Ensure both are one-dimensional arrays
actual_values = np.squeeze(actual_values)
forecast_values = np.squeeze(forecast_values)
# Calculate the MAPE for each row
mape_row_by_row = np.abs((actual_values - forecast_values) / actual_values) * 100
# Create a DataFrame to display the row-by-row MAPE along with the actual and forecast values
df_results = pd.DataFrame({
'Actual': actual_values,
'Forecast': forecast_values,
'MAPE': mape_row_by_row
})
# Calculate the overall weighted MAPE
weights = actual_values / actual_values.sum()
overall_mape = np.sum(mape_row_by_row * weights)
# Display the DataFrame
print(df_results)
# Print the overall MAPE
print(f"Overall MAPE: {overall_mape:.2f}%")
Actual Forecast MAPE 2015-03-14 6141.6230 6154.238994 0.205418 2015-03-15 7154.1210 6606.418043 7.655769 2015-03-16 4833.7110 5115.085114 5.821079 2015-03-17 3950.7646 4712.327626 19.276345 2015-03-18 4107.2850 4622.632162 12.547149 ... ... ... ... 2015-12-27 4446.1396 5172.092356 16.327709 2015-12-28 4131.3706 5172.186530 25.192994 2015-12-29 4129.5820 5172.280705 25.249498 2015-12-30 4345.4190 5172.374883 19.030521 2015-12-31 5333.0360 5172.469062 3.010798 [293 rows x 3 columns] Overall MAPE: 18.14%
- Check Overestimate/Underestimate Forecasting & Predicted/Actual Pattern Match
# Plot predicted results
plt.figure(figsize=(12, 6))
plt.plot(test_y.index, test_y, label="Actual Revenue", color='blue', linewidth=2)
plt.plot(test_y.index, CA_1_predictions, label="Predicted Revenue", color='red', linestyle="dashed", linewidth=2)
plt.xlabel("")
plt.ylabel("Revenue")
plt.title("CA_1 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events")
plt.legend()
plt.grid(True)
plt.show()
- Check Evaluation Metrics
# Compute evaluation metrics
mse = mean_squared_error(test_y, CA_1_predictions)
rmse = np.sqrt(mean_squared_error(test_y, CA_1_predictions))
mae = mean_absolute_error(test_y, CA_1_predictions)
mape = np.mean(np.abs((test_y - CA_1_predictions) / test_y)) * 100
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)
print(evaluation_results)
Mean Squared Error (MSE): 1434183.47 Root Mean Squared Error (RMSE): 1197.57 Mean Absolute Error (MAE): 943.87 Mean Absolute Percentage Error (MAPE): inf%
mape_value = mean_absolute_percentage_error(
test_y,
CA_1_predictions
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 7949551095629456384.00%
Comparison of CA_1 Forecast Plots with Evaluation Metrics¶
| CA_1 Top_Down Forecast | CA_1 Direct Forecast |
|---|---|
![]() MSE: 674801 RMSE: 821 MAE: 560 MAPE: infinite |
![]() MSE: 1434183 RMSE: 1198 MAE: 944 MAPE: infinite |
Conclusion: Top-down approach is recommended in CA_1 Store Revenue Forecast
# Filter for store_id == 'CA_2'
CA_2_daily = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_2'].copy()
CA_2_daily
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | week | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1351 | 2012-01-01 | CA_2 | 941 | 1.473136 | 1994.1145 | 1.0 | 0 | 0 | 1 | 0 | 0 | 2011-12-26/2012-01-01 | 2012-01 | 2012 |
| 1353 | 2012-01-02 | CA_2 | 1364 | 1.473136 | 2982.6143 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1358 | 2012-01-03 | CA_2 | 939 | 1.473136 | 1945.9517 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1362 | 2012-01-04 | CA_2 | 853 | 1.473136 | 1775.3049 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1365 | 2012-01-05 | CA_2 | 971 | 1.473136 | 2079.2812 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7173 | 2015-12-27 | CA_2 | 1627 | 2.923421 | 4265.5730 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-21/2015-12-27 | 2015-12 | 2015 |
| 7177 | 2015-12-28 | CA_2 | 1369 | 2.923421 | 3385.6113 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7181 | 2015-12-29 | CA_2 | 1406 | 2.923421 | 3380.6145 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7187 | 2015-12-30 | CA_2 | 1625 | 2.923421 | 3926.1920 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7190 | 2015-12-31 | CA_2 | 2087 | 2.923421 | 5141.0130 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
1461 rows × 14 columns
# Plot daily revenue trend
plt.figure(figsize=(12,6))
plt.plot(CA_2_daily['date'], CA_2_daily['revenue'], label='Daily Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('CA_2 Store Daily Revenue Trend Over Time')
plt.legend()
plt.grid(True)
plt.show()
# Plot histograms for each feature (Skip first plot - date column)
CA_2_daily.iloc[:, 1:].hist(bins=30, figsize=(12, 10))
plt.tight_layout()
plt.show()
CA_2_daily.describe()
| date | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 1461 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 |
| mean | 2013-12-31 00:00:00 | 1277.983573 | 2.278975 | 2887.853205 | 0.328542 | 0.002738 | 0.002738 | 0.002738 | 0.002738 | 0.002738 |
| min | 2012-01-01 00:00:00 | 2.000000 | 1.473136 | 3.160156 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2012-12-31 00:00:00 | 1018.000000 | 1.817200 | 2294.994400 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 2013-12-31 00:00:00 | 1204.000000 | 2.281236 | 2688.572000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 2014-12-31 00:00:00 | 1494.000000 | 2.747719 | 3392.054700 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 2015-12-31 00:00:00 | 2867.000000 | 2.924174 | 6952.808000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| std | NaN | 364.599095 | 0.469592 | 836.461332 | 0.469844 | 0.052271 | 0.052271 | 0.052271 | 0.052271 | 0.052271 |
CA_2_daily.shape
(1461, 14)
# Apply decomposition (Additive model for clearer seasonality)
decomposition = seasonal_decompose(CA_2_daily.set_index('date')['revenue'], model='additive', period=365)
# Plot decomposition results
plt.figure(figsize=(12, 8))
fig = decomposition.plot()
fig.axes[0].set_title("")
plt.suptitle(f"Seasonal Decomposition of Revenue - CA_2 Store", fontsize=14)
plt.show()
<Figure size 1200x800 with 0 Axes>
# Extract seasonal component
seasonality_strength = np.var(decomposition.seasonal) / np.var(CA_2_daily['revenue'])
# Display the seasonality strength
seasonality_strength
0.300318352234158
Roughly 30% of the variance in CA_2 store revenue series is explained by seasonality, indicating a moderate degree of seasonality¶
# Autocorrelation and Partial Autocorrelation plots
plt.figure(figsize=(12,4))
plot_acf(CA_2_daily['revenue'], lags=50, zero=False, ax=plt.gca()) # Autocorrelation from lag 1
plt.title('Autocorrelation of Revenue (starting from lag 1)')
plt.show()
plt.figure(figsize=(12,4))
plot_pacf(CA_2_daily['revenue'], lags=50, zero=False, ax=plt.gca()) # Partial Autocorrelation from lag 1
plt.title('Partial Autocorrelation of Revenue (starting from lag 1)')
plt.show()
PACF is p or AR, it is pretty obvious that p is 1
ACF is q or MA, it is not obvious, need auto ARIMA to further determine q
Begin Modeling¶
- Apply ADF test to check stationarity
# Automate Augmented Dickey-Fuller (ADF) test for stationarity
adf_test(CA_2_daily['revenue'])
Augmented Dickey-Fuller Test: ADF test statistic -1.379162 p-value 0.592225 # lags used 22.000000 # observations 1438.000000 critical value (1%) -3.434906 critical value (5%) -2.863552 critical value (10%) -2.567841 Weak evidence against the null hypothesis Fail to reject the null hypothesis Data has a unit root and is non-stationary
Since it is a non-stationary data, we need to difference it to achieve stationary
# Apply first-order differencing
CA_2_daily['revenue_diff'] = CA_2_daily['revenue'].diff()
# Perform ADF test again on the differenced series
adf_test(CA_2_daily['revenue_diff'])
Augmented Dickey-Fuller Test: ADF test statistic -1.122857e+01 p-value 1.932892e-20 # lags used 2.100000e+01 # observations 1.438000e+03 critical value (1%) -3.434906e+00 critical value (5%) -2.863552e+00 critical value (10%) -2.567841e+00 Strong evidence against the null hypothesis Reject the null hypothesis Data has no unit root and is stationary
Since first-order differencing makes data stationary, d = 1
# Plot the differenced revenue
plt.figure(figsize=(12,6))
plt.plot(CA_2_daily['date'], CA_2_daily['revenue_diff'], label='Differenced Revenue', color='purple')
plt.xlabel('')
plt.ylabel('Differenced Revenue')
plt.title('First-Order Differencing of Revenue to Achieve Stationarity')
plt.legend()
plt.grid(True)
plt.show()
- Use auto ARIMA to determine optimal parameters (p, d, q). Since p=1 from PACF, we force p=1 in auto ARIMA to reduce computing time
# Define the endogenous (target) variable and exogenous variables
y = CA_2_daily.set_index('date')['revenue']
exo = CA_2_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]
# Use Auto ARIMA to determine optimal SARIMAX parameters
auto_arima_model = pm.auto_arima(
y,
exogenous=exo,
seasonal=True,
m=7, # Assuming weekly seasonality
start_p=1, # Minimum AR order
max_p=1, # Maximum AR order (forces p = 1)
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True
)
# Extract the best parameters
optimal_order = auto_arima_model.order
optimal_seasonal_order = auto_arima_model.seasonal_order
optimal_order, optimal_seasonal_order
Performing stepwise search to minimize aic ARIMA(1,1,2)(1,0,1)[7] intercept : AIC=21328.553, Time=3.30 sec ARIMA(0,1,0)(0,0,0)[7] intercept : AIC=23384.032, Time=0.04 sec ARIMA(1,1,0)(1,0,0)[7] intercept : AIC=22187.029, Time=1.15 sec ARIMA(0,1,1)(0,0,1)[7] intercept : AIC=22612.359, Time=1.24 sec ARIMA(0,1,0)(0,0,0)[7] : AIC=23382.045, Time=0.02 sec ARIMA(1,1,2)(0,0,1)[7] intercept : AIC=22380.623, Time=2.23 sec ARIMA(1,1,2)(1,0,0)[7] intercept : AIC=inf, Time=1.98 sec ARIMA(1,1,2)(2,0,1)[7] intercept : AIC=inf, Time=4.02 sec ARIMA(1,1,2)(1,0,2)[7] intercept : AIC=21282.002, Time=4.56 sec ARIMA(1,1,2)(0,0,2)[7] intercept : AIC=22056.336, Time=2.36 sec ARIMA(1,1,2)(2,0,2)[7] intercept : AIC=21329.344, Time=5.22 sec ARIMA(0,1,2)(1,0,2)[7] intercept : AIC=inf, Time=4.44 sec ARIMA(1,1,1)(1,0,2)[7] intercept : AIC=inf, Time=4.80 sec ARIMA(1,1,3)(1,0,2)[7] intercept : AIC=inf, Time=6.27 sec ARIMA(0,1,1)(1,0,2)[7] intercept : AIC=inf, Time=3.42 sec ARIMA(0,1,3)(1,0,2)[7] intercept : AIC=21794.315, Time=6.28 sec ARIMA(1,1,2)(1,0,2)[7] : AIC=21295.310, Time=2.90 sec Best model: ARIMA(1,1,2)(1,0,2)[7] intercept Total fit time: 54.256 seconds
((1, 1, 2), (1, 0, 2, 7))
# Applying AUTO_ARIMA model to get p and q automatically
auto_arima_model.summary()
| Dep. Variable: | y | No. Observations: | 1461 |
|---|---|---|---|
| Model: | SARIMAX(1, 1, 2)x(1, 0, 2, 7) | Log Likelihood | -10633.001 |
| Date: | Wed, 12 Mar 2025 | AIC | 21282.002 |
| Time: | 22:18:04 | BIC | 21324.292 |
| Sample: | 01-01-2012 | HQIC | 21297.778 |
| - 12-31-2015 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | -0.0028 | 0.024 | -0.117 | 0.907 | -0.049 | 0.044 |
| ar.L1 | 0.3974 | 0.112 | 3.558 | 0.000 | 0.178 | 0.616 |
| ma.L1 | -1.1292 | 0.122 | -9.292 | 0.000 | -1.367 | -0.891 |
| ma.L2 | 0.1766 | 0.110 | 1.598 | 0.110 | -0.040 | 0.393 |
| ar.S.L7 | 0.9989 | 0.001 | 1219.709 | 0.000 | 0.997 | 1.000 |
| ma.S.L7 | -0.8375 | 0.028 | -29.874 | 0.000 | -0.892 | -0.783 |
| ma.S.L14 | 0.0244 | 0.026 | 0.936 | 0.349 | -0.027 | 0.076 |
| sigma2 | 1.231e+05 | 2094.390 | 58.775 | 0.000 | 1.19e+05 | 1.27e+05 |
| Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 19295.16 |
|---|---|---|---|
| Prob(Q): | 0.95 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 2.23 | Skew: | -1.39 |
| Prob(H) (two-sided): | 0.00 | Kurtosis: | 20.59 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
- Apply auto ARIMA results to fit SARIMAX model
# Fit the SARIMAX model
sarimax_model = SARIMAX(y,
exog=exo,
order=optimal_order,
seasonal_order=optimal_seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
sarimax_results = sarimax_model.fit()
# Display model summary
sarimax_results.summary()
| Dep. Variable: | revenue | No. Observations: | 1461 |
|---|---|---|---|
| Model: | SARIMAX(1, 1, 2)x(1, 0, 2, 7) | Log Likelihood | -10298.829 |
| Date: | Wed, 12 Mar 2025 | AIC | 20621.658 |
| Time: | 22:18:15 | BIC | 20684.951 |
| Sample: | 01-01-2012 | HQIC | 20645.283 |
| - 12-31-2015 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| event1_Christmas | -2800.2065 | 61.289 | -45.689 | 0.000 | -2920.330 | -2680.083 |
| event1_Thanksgiving | -1268.2217 | 118.327 | -10.718 | 0.000 | -1500.138 | -1036.305 |
| event1_NewYear | -655.0949 | 313.174 | -2.092 | 0.036 | -1268.906 | -41.284 |
| event1_SuperBowl | 250.1420 | 79.756 | 3.136 | 0.002 | 93.824 | 406.460 |
| event1_LaborDay | 706.7567 | 91.645 | 7.712 | 0.000 | 527.136 | 886.377 |
| ar.L1 | 0.3004 | 0.075 | 4.008 | 0.000 | 0.153 | 0.447 |
| ma.L1 | -0.9097 | 0.081 | -11.264 | 0.000 | -1.068 | -0.751 |
| ma.L2 | -0.0314 | 0.071 | -0.440 | 0.660 | -0.171 | 0.109 |
| ar.S.L7 | 0.9932 | 0.003 | 286.705 | 0.000 | 0.986 | 1.000 |
| ma.S.L7 | -0.8231 | 0.027 | -30.807 | 0.000 | -0.875 | -0.771 |
| ma.S.L14 | 0.0538 | 0.025 | 2.178 | 0.029 | 0.005 | 0.102 |
| sigma2 | 9.2e+04 | 2531.910 | 36.335 | 0.000 | 8.7e+04 | 9.7e+04 |
| Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 316.66 |
|---|---|---|---|
| Prob(Q): | 0.98 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 2.09 | Skew: | 0.13 |
| Prob(H) (two-sided): | 0.00 | Kurtosis: | 5.28 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
# Plot diagnostic plots (residuals, etc.)
sarimax_results.plot_diagnostics(figsize=(15, 12))
plt.show()
- Plot actual and predicted trend
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Actual', marker='o')
plt.plot(y.index, sarimax_results.fittedvalues, label='Fitted', marker='x', color='red')
plt.title('Actual vs Fitted Values')
plt.xlabel('')
plt.ylabel('Revenue')
plt.legend()
plt.show()
- Evaluate predicted result using MAPE
# Actual values (endogenous variable)
y_true = y # or test set if we're evaluating out-of-sample
# In-sample fitted values from SARIMAX
y_pred = sarimax_results.fittedvalues
# Calculate MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f'MAPE: {mape:.2f}%')
MAPE: 56.82%
mape_value = mean_absolute_percentage_error(
y_true,
y_pred
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 56.82%
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
)
print(evaluation_results)
Mean Squared Error (MSE): 101133.29 Root Mean Squared Error (RMSE): 318.01 Mean Absolute Error (MAE): 231.67
Model Evaluation¶
- Train-Test Validation
# Split data into train and test sets (80-20 split)
train_size = int(len(y) * 0.8)
train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]
# Refit SARIMAX model on training data
sarimax_model_train = SARIMAX(train_y,
exog=train_exo,
order=optimal_order,
seasonal_order=optimal_seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
sarimax_results_train = sarimax_model_train.fit()
# Forecast on test set
CA_2_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)
Since forecast has very poor accuracy, we manually tune SARIMAX parameters below but still not see improvement. Therefore, we will proceed with auto ARIMA result above¶
# # Split data into train and test sets (80-20 split)
# train_size = int(len(y) * 0.8)
# train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
# train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]
# # Refit SARIMAX model on training data
# sarimax_model_train = SARIMAX(train_y,
# exog=train_exo,
# order=(5, 1, 5),
# seasonal_order=(2, 0, 1, 7),
# enforce_stationarity=False,
# enforce_invertibility=False)
# sarimax_results_train = sarimax_model_train.fit()
# # Forecast on test set
# CA_2_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)
- Check Distribution of MAPE
# Extract the actual and forecast values from the DataFrame
actual_values = test_y.values
forecast_values = CA_2_predictions
# Ensure both are one-dimensional arrays
actual_values = np.squeeze(actual_values)
forecast_values = np.squeeze(forecast_values)
# Calculate the MAPE for each row
mape_row_by_row = np.abs((actual_values - forecast_values) / actual_values) * 100
# Create a DataFrame to display the row-by-row MAPE along with the actual and forecast values
df_results = pd.DataFrame({
'Actual': actual_values,
'Forecast': forecast_values,
'MAPE': mape_row_by_row
})
# Calculate the overall weighted MAPE
weights = actual_values / actual_values.sum()
overall_mape = np.sum(mape_row_by_row * weights)
# Display the DataFrame
print(df_results)
# Print the overall MAPE
print(f"Overall MAPE: {overall_mape:.2f}%")
Actual Forecast MAPE 2015-03-14 3014.8530 2932.156404 2.742973 2015-03-15 3557.4578 2979.284514 16.252429 2015-03-16 1831.1394 1870.941040 2.173600 2015-03-17 1876.0859 1762.571225 6.050612 2015-03-18 1772.1809 1739.843169 1.824742 ... ... ... ... 2015-12-27 4265.5730 2089.978382 51.003573 2015-12-28 3385.6113 1035.757091 69.407088 2015-12-29 3380.6145 931.965251 72.432076 2015-12-30 3926.1920 913.284018 76.738682 2015-12-31 5141.0130 985.625108 80.828193 [293 rows x 3 columns] Overall MAPE: 51.03%
- Check Overestimate/Underestimate Forecasting & Predicted/Actual Pattern Match
# Plot predicted results
plt.figure(figsize=(12, 6))
plt.plot(test_y.index, test_y, label="Actual Revenue", color='blue', linewidth=2)
plt.plot(test_y.index, CA_2_predictions, label="Predicted Revenue", color='red', linestyle="dashed", linewidth=2)
plt.xlabel("")
plt.ylabel("Revenue")
plt.title("CA_2 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events")
plt.legend()
plt.grid(True)
plt.show()
- Check Evaluation Metrics
# Compute evaluation metrics
mse = mean_squared_error(test_y, CA_2_predictions)
rmse = np.sqrt(mean_squared_error(test_y, CA_2_predictions))
mae = mean_absolute_error(test_y, CA_2_predictions)
mape = np.mean(np.abs((test_y - CA_2_predictions) / test_y)) * 100
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)
print(evaluation_results)
Mean Squared Error (MSE): 4436671.37 Root Mean Squared Error (RMSE): 2106.34 Mean Absolute Error (MAE): 1791.30 Mean Absolute Percentage Error (MAPE): 145.45%
mape_value = mean_absolute_percentage_error(
test_y,
CA_2_predictions
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 145.45%
# Plot actual train data
plt.figure(figsize=(12, 6))
plt.plot(train_y.index, train_y, label="Actual Revenue", color='blue', linewidth=2)
plt.xlabel("")
plt.ylabel("Revenue")
plt.title("train_actual")
plt.legend()
plt.grid(True)
plt.show()
To investigate the reason why the forecast is too off, we plot train data above. It could be the train data bias from 2015-01 to 2015-03, since in that period the data trend decays, causing our forecast trend also go downward and not catch actual rising trend from 2015-03 to 2015-12¶
Comparison of CA_2 Forecast Plots with Evaluation Metrics¶
| CA_2 Top_Down Forecast | CA_2 Direct Forecast |
|---|---|
![]() MSE: 1572627 RMSE: 1254 MAE: 1021 MAPE: 281% |
![]() MSE: 4436671 RMSE: 2106 MAE: 1791 MAPE: 145% |
Conclusion: Either statistical approach is not recommended in CA_2 Store Revenue Forecast, suggesting ML & DL approaches
# Filter for store_id == 'CA_3'
CA_3_daily = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_3'].copy()
CA_3_daily
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | week | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1349 | 2012-01-01 | CA_3 | 2051 | 1.630491 | 4246.9053 | 1.0 | 0 | 0 | 1 | 0 | 0 | 2011-12-26/2012-01-01 | 2012-01 | 2012 |
| 1354 | 2012-01-02 | CA_3 | 3185 | 1.630491 | 6694.9766 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1357 | 2012-01-03 | CA_3 | 2842 | 1.630491 | 5886.2085 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1361 | 2012-01-04 | CA_3 | 2569 | 1.630491 | 5463.7550 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1366 | 2012-01-05 | CA_3 | 2645 | 1.630491 | 5683.1963 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7174 | 2015-12-27 | CA_3 | 2195 | 2.875478 | 5824.9710 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-21/2015-12-27 | 2015-12 | 2015 |
| 7178 | 2015-12-28 | CA_3 | 2205 | 2.875478 | 5684.8975 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7182 | 2015-12-29 | CA_3 | 2053 | 2.875478 | 5417.0146 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7185 | 2015-12-30 | CA_3 | 2269 | 2.875478 | 5948.0480 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7191 | 2015-12-31 | CA_3 | 2682 | 2.875478 | 6914.9814 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
1461 rows × 14 columns
# Plot daily revenue trend
plt.figure(figsize=(12,6))
plt.plot(CA_3_daily['date'], CA_3_daily['revenue'], label='Daily Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('CA_3 Store Daily Revenue Trend Over Time')
plt.legend()
plt.grid(True)
plt.show()
# Plot histograms for each feature (skip first column - date)
CA_3_daily.iloc[:, 1:].hist(bins=30, figsize=(12, 10))
plt.tight_layout()
plt.show()
CA_3_daily.describe()
| date | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 1461 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 |
| mean | 2013-12-31 00:00:00 | 3065.451745 | 2.387430 | 6731.888187 | 0.328542 | 0.002738 | 0.002738 | 0.002738 | 0.002738 | 0.002738 |
| min | 2012-01-01 00:00:00 | 0.000000 | 1.630491 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2012-12-31 00:00:00 | 2629.000000 | 1.988541 | 5855.423000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 2013-12-31 00:00:00 | 2992.000000 | 2.396947 | 6596.341000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 2014-12-31 00:00:00 | 3453.000000 | 2.762711 | 7539.036600 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 2015-12-31 00:00:00 | 5118.000000 | 2.900847 | 11199.707000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| std | NaN | 618.825955 | 0.400079 | 1320.156196 | 0.469844 | 0.052271 | 0.052271 | 0.052271 | 0.052271 | 0.052271 |
CA_3_daily.shape
(1461, 14)
# Apply decomposition (Additive model for clearer seasonality)
decomposition = seasonal_decompose(CA_3_daily.set_index('date')['revenue'], model='additive', period=365)
# Plot decomposition results
plt.figure(figsize=(12, 8))
fig = decomposition.plot()
fig.axes[0].set_title("")
plt.suptitle(f"Seasonal Decomposition of Revenue - CA_3 Store", fontsize=14)
plt.show()
<Figure size 1200x800 with 0 Axes>
# Extract seasonal component
seasonality_strength = np.var(decomposition.seasonal) / np.var(CA_3_daily['revenue'])
# Display the seasonality strength
seasonality_strength
0.5534479211812071
Roughly 55% of the variance in CA_3 store revenue series is explained by seasonality, indicating a strong degree of seasonality¶
# Autocorrelation and Partial Autocorrelation plots
plt.figure(figsize=(12,4))
plot_acf(CA_3_daily['revenue'], lags=50, zero=False, ax=plt.gca()) # Autocorrelation from lag 1
plt.title('Autocorrelation of Revenue (starting from lag 1)')
plt.show()
plt.figure(figsize=(12,4))
plot_pacf(CA_3_daily['revenue'], lags=50, zero=False, ax=plt.gca()) # Partial Autocorrelation from lag 1
plt.title('Partial Autocorrelation of Revenue (starting from lag 1)')
plt.show()
PACF is p or AR, it is pretty obvious that p is 1
ACF is q or MA, it is not obvious, need auto ARIMA to further determine q
Begin Modeling¶
- Apply ADF test to check stationarity
# Automate Augmented Dickey-Fuller (ADF) test for stationarity
adf_test(CA_3_daily['revenue'])
Augmented Dickey-Fuller Test: ADF test statistic -3.028698 p-value 0.032300 # lags used 24.000000 # observations 1436.000000 critical value (1%) -3.434912 critical value (5%) -2.863555 critical value (10%) -2.567843 Strong evidence against the null hypothesis Reject the null hypothesis Data has no unit root and is stationary
Since it is a stationary data, we don't need to difference it to achieve stationary, skip the below differencing steps
# # Apply first-order differencing
# CA_3_daily['revenue_diff'] = CA_3_daily['revenue'].diff()
# # Perform ADF test again on the differenced series
# adf_test(CA_3_daily['revenue_diff'])
Augmented Dickey-Fuller Test: ADF test statistic -19.862482 p-value 0.000000 # lags used 26.000000 # observations 1913.000000 critical value (1%) -3.433773 critical value (5%) -2.863052 critical value (10%) -2.567575 Strong evidence against the null hypothesis Reject the null hypothesis Data has no unit root and is stationary
# # Plot the differenced revenue
# plt.figure(figsize=(12,6))
# plt.plot(CA_3_daily['date'], CA_3_daily['revenue_diff'], label='Differenced Revenue', color='purple')
# plt.xlabel('')
# plt.ylabel('Differenced Revenue')
# plt.title('First-Order Differencing of Revenue to Achieve Stationarity')
# plt.legend()
# plt.grid(True)
# plt.show()
- Use auto ARIMA to determine optimal parameters (p, d, q). Since p=1 from PACF, we force p=1 in auto ARIMA to reduce computing time
# Define the endogenous (target) variable and exogenous variables
y = CA_3_daily.set_index('date')['revenue']
exo = CA_3_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]
# Use Auto ARIMA to determine optimal SARIMAX parameters
auto_arima_model = pm.auto_arima(
y,
exogenous=exo,
stationary=True, # Tells auto_arima that the series is already stationary
seasonal=True,
m=7, # Assuming weekly seasonality
start_p=1, # Minimum AR order
max_p=1, # Maximum AR order (forces p = 1)
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True
)
# Extract the best parameters
optimal_order = auto_arima_model.order
optimal_seasonal_order = auto_arima_model.seasonal_order
optimal_order, optimal_seasonal_order
Performing stepwise search to minimize aic ARIMA(1,0,2)(1,0,1)[7] intercept : AIC=25449.209, Time=2.39 sec ARIMA(0,0,0)(0,0,0)[7] intercept : AIC=25145.185, Time=0.03 sec ARIMA(1,0,0)(1,0,0)[7] intercept : AIC=23993.565, Time=0.66 sec ARIMA(0,0,1)(0,0,1)[7] intercept : AIC=24155.931, Time=0.72 sec ARIMA(0,0,0)(0,0,0)[7] : AIC=29959.527, Time=0.02 sec ARIMA(1,0,0)(0,0,0)[7] intercept : AIC=24375.603, Time=0.07 sec ARIMA(1,0,0)(2,0,0)[7] intercept : AIC=23902.358, Time=3.44 sec ARIMA(1,0,0)(2,0,1)[7] intercept : AIC=inf, Time=3.96 sec ARIMA(1,0,0)(1,0,1)[7] intercept : AIC=23913.835, Time=1.39 sec ARIMA(0,0,0)(2,0,0)[7] intercept : AIC=25101.957, Time=2.65 sec ARIMA(1,0,1)(2,0,0)[7] intercept : AIC=23920.590, Time=3.75 sec ARIMA(0,0,1)(2,0,0)[7] intercept : AIC=24497.755, Time=2.64 sec ARIMA(1,0,0)(2,0,0)[7] : AIC=inf, Time=1.04 sec Best model: ARIMA(1,0,0)(2,0,0)[7] intercept Total fit time: 22.768 seconds
((1, 0, 0), (2, 0, 0, 7))
# Applying AUTO_ARIMA model to get p and q automatically
auto_arima_model.summary()
| Dep. Variable: | y | No. Observations: | 1461 |
|---|---|---|---|
| Model: | SARIMAX(1, 0, 0)x(2, 0, 0, 7) | Log Likelihood | -11946.179 |
| Date: | Wed, 12 Mar 2025 | AIC | 23902.358 |
| Time: | 20:54:42 | BIC | 23928.793 |
| Sample: | 01-01-2012 | HQIC | 23912.219 |
| - 12-31-2015 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 2391.1829 | 112.571 | 21.242 | 0.000 | 2170.549 | 2611.817 |
| ar.L1 | 0.3721 | 0.019 | 19.382 | 0.000 | 0.334 | 0.410 |
| ar.S.L7 | 0.3209 | 0.027 | 11.940 | 0.000 | 0.268 | 0.374 |
| ar.S.L14 | 0.1263 | 0.024 | 5.338 | 0.000 | 0.080 | 0.173 |
| sigma2 | 7.29e+05 | 1.62e+04 | 44.903 | 0.000 | 6.97e+05 | 7.61e+05 |
| Ljung-Box (L1) (Q): | 80.94 | Jarque-Bera (JB): | 1863.03 |
|---|---|---|---|
| Prob(Q): | 0.00 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 0.99 | Skew: | -0.43 |
| Prob(H) (two-sided): | 0.91 | Kurtosis: | 8.46 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
- Apply auto ARIMA results to fit SARIMAX model
# Fit the SARIMAX model
sarimax_model = SARIMAX(y,
exog=exo,
order=optimal_order,
seasonal_order=optimal_seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
sarimax_results = sarimax_model.fit()
# Display model summary
sarimax_results.summary()
| Dep. Variable: | revenue | No. Observations: | 1461 |
|---|---|---|---|
| Model: | SARIMAX(1, 0, 0)x(2, 0, 0, 7) | Log Likelihood | -12338.703 |
| Date: | Wed, 12 Mar 2025 | AIC | 24695.406 |
| Time: | 20:54:53 | BIC | 24742.895 |
| Sample: | 01-01-2012 | HQIC | 24713.130 |
| - 12-31-2015 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| event1_Christmas | 5.1350 | 243.896 | 0.021 | 0.983 | -472.893 | 483.163 |
| event1_Thanksgiving | 4482.3930 | 242.378 | 18.493 | 0.000 | 4007.342 | 4957.444 |
| event1_NewYear | 4298.8950 | 325.920 | 13.190 | 0.000 | 3660.103 | 4937.687 |
| event1_SuperBowl | 8513.2784 | 261.749 | 32.525 | 0.000 | 8000.260 | 9026.297 |
| event1_LaborDay | 8554.2418 | 246.461 | 34.708 | 0.000 | 8071.187 | 9037.297 |
| ar.L1 | 0.3581 | 0.023 | 15.638 | 0.000 | 0.313 | 0.403 |
| ar.S.L7 | 0.5383 | 0.020 | 27.345 | 0.000 | 0.500 | 0.577 |
| ar.S.L14 | 0.4366 | 0.018 | 24.206 | 0.000 | 0.401 | 0.472 |
| sigma2 | 2.114e+06 | 1.25e+05 | 16.958 | 0.000 | 1.87e+06 | 2.36e+06 |
| Ljung-Box (L1) (Q): | 6.89 | Jarque-Bera (JB): | 7805.20 |
|---|---|---|---|
| Prob(Q): | 0.01 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 0.90 | Skew: | -0.84 |
| Prob(H) (two-sided): | 0.22 | Kurtosis: | 14.26 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
# Plot diagnostic plots (residuals, etc.)
sarimax_results.plot_diagnostics(figsize=(15, 12))
plt.show()
- Plot actual and predicted trend
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Actual', marker='o')
plt.plot(y.index, sarimax_results.fittedvalues, label='Fitted', marker='x', color='red')
plt.title('Actual vs Fitted Values')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.legend()
plt.show()
- Evaluate predicted result using MAPE
# Actual values (endogenous variable)
y_true = y # or test set if we're evaluating out-of-sample
# In-sample fitted values from SARIMAX
y_pred = sarimax_results.fittedvalues
# Calculate MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f'MAPE: {mape:.2f}%')
MAPE: inf%
mape_value = mean_absolute_percentage_error(
y_true,
y_pred
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 1391159110222060544.00%
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
)
print(evaluation_results)
Mean Squared Error (MSE): 1504035.74 Root Mean Squared Error (RMSE): 1226.39 Mean Absolute Error (MAE): 768.68
Model Evaluation¶
- Train-Test Validation
# Split data into train and test sets (80-20 split)
train_size = int(len(y) * 0.8)
train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]
# Refit SARIMAX model on training data
sarimax_model_train = SARIMAX(train_y,
exog=train_exo,
order=optimal_order,
seasonal_order=optimal_seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
sarimax_results_train = sarimax_model_train.fit()
# Forecast on test set
CA_3_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)
- Check Distribution of MAPE
# Extract the actual and forecast values from the DataFrame
actual_values = test_y.values
forecast_values = CA_3_predictions
# Ensure both are one-dimensional arrays
actual_values = np.squeeze(actual_values)
forecast_values = np.squeeze(forecast_values)
# Calculate the MAPE for each row
mape_row_by_row = np.abs((actual_values - forecast_values) / actual_values) * 100
# Create a DataFrame to display the row-by-row MAPE along with the actual and forecast values
df_results = pd.DataFrame({
'Actual': actual_values,
'Forecast': forecast_values,
'MAPE': mape_row_by_row
})
# Calculate the overall weighted MAPE
weights = actual_values / actual_values.sum()
overall_mape = np.sum(mape_row_by_row * weights)
# Display the DataFrame
print(df_results)
# Print the overall MAPE
print(f"Overall MAPE: {overall_mape:.2f}%")
Actual Forecast MAPE 2015-03-14 8467.2440 8462.797537 0.052514 2015-03-15 10116.2220 9585.442269 5.246818 2015-03-16 7627.6895 7536.943879 1.189687 2015-03-17 6562.2295 6866.026313 4.629476 2015-03-18 5713.6250 6065.148979 6.152381 ... ... ... ... 2015-12-27 5824.9710 4784.281592 17.866002 2015-12-28 5684.8975 3769.431435 33.693942 2015-12-29 5417.0146 3390.932751 37.402185 2015-12-30 5948.0480 2988.907192 49.749780 2015-12-31 6914.9814 3168.309109 54.181958 [293 rows x 3 columns] Overall MAPE: 23.73%
- Check Overestimate/Underestimate Forecasting & Predicted/Actual Pattern Match
# Plot predicted results
plt.figure(figsize=(12, 6))
plt.plot(test_y.index, test_y, label="Actual Revenue", color='blue', linewidth=2)
plt.plot(test_y.index, CA_3_predictions, label="Predicted Revenue", color='red', linestyle="dashed", linewidth=2)
plt.xlabel("")
plt.ylabel("Revenue")
plt.title("CA_3 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events")
plt.legend()
plt.grid(True)
plt.show()
- Check Evaluation Metrics
# Compute evaluation metrics
mse = mean_squared_error(test_y, CA_3_predictions)
rmse = np.sqrt(mean_squared_error(test_y, CA_3_predictions))
mae = mean_absolute_error(test_y, CA_3_predictions)
mape = np.mean(np.abs((test_y - CA_3_predictions) / test_y)) * 100
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)
print(evaluation_results)
Mean Squared Error (MSE): 3547322.56 Root Mean Squared Error (RMSE): 1883.43 Mean Absolute Error (MAE): 1603.13 Mean Absolute Percentage Error (MAPE): 164.26%
mape_value = mean_absolute_percentage_error(
test_y,
CA_3_predictions
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 164.26%
Comparison of CA_3 Forecast Plots with Evaluation Metrics¶
| CA_3 Top_Down Forecast | CA_3 Direct Forecast |
|---|---|
![]() MSE: 1469628 RMSE: 1212 MAE: 820 MAPE: 317% |
![]() MSE: 3547323 RMSE: 1883 MAE: 1603 MAPE: 164% |
Conclusion: Top-down approach is recommended in CA_3 Store Revenue Forecast
# Filter for store_id == 'CA_4'
CA_4_daily = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_4'].copy()
CA_4_daily
| date | store_id | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | week | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1350 | 2012-01-01 | CA_4 | 738 | 1.589748 | 1637.3416 | 1.0 | 0 | 0 | 1 | 0 | 0 | 2011-12-26/2012-01-01 | 2012-01 | 2012 |
| 1355 | 2012-01-02 | CA_4 | 967 | 1.589748 | 2332.8342 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1356 | 2012-01-03 | CA_4 | 876 | 1.589748 | 1856.1882 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1360 | 2012-01-04 | CA_4 | 728 | 1.589748 | 1592.6340 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| 1367 | 2012-01-05 | CA_4 | 936 | 1.589748 | 2176.2485 | 1.0 | 0 | 0 | 0 | 0 | 0 | 2012-01-02/2012-01-08 | 2012-01 | 2012 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7175 | 2015-12-27 | CA_4 | 721 | 2.924935 | 1891.2354 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-21/2015-12-27 | 2015-12 | 2015 |
| 7179 | 2015-12-28 | CA_4 | 753 | 2.924935 | 2110.1110 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7183 | 2015-12-29 | CA_4 | 642 | 2.924935 | 1747.9710 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7184 | 2015-12-30 | CA_4 | 713 | 2.924935 | 1981.7717 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
| 7188 | 2015-12-31 | CA_4 | 1072 | 2.924935 | 2814.8704 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2015-12-28/2016-01-03 | 2015-12 | 2015 |
1461 rows × 14 columns
# Plot daily revenue trend
plt.figure(figsize=(12,6))
plt.plot(CA_4_daily['date'], CA_4_daily['revenue'], label='Daily Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('CA_4 Store Daily Revenue Trend Over Time')
plt.legend()
plt.grid(True)
plt.show()
# Plot histograms for each feature
CA_4_daily.iloc[:, 1:].hist(bins=30, figsize=(12, 10))
plt.tight_layout()
plt.show()
CA_4_daily.describe()
| date | unit_sold | sell_price | revenue | snap_CA | event1_Christmas | event1_Thanksgiving | event1_NewYear | event1_SuperBowl | event1_LaborDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 1461 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 |
| mean | 2013-12-31 00:00:00 | 1066.749487 | 2.379939 | 2534.858355 | 0.328542 | 0.002738 | 0.002738 | 0.002738 | 0.002738 | 0.002738 |
| min | 2012-01-01 00:00:00 | 0.000000 | 1.589748 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2012-12-31 00:00:00 | 942.000000 | 1.969490 | 2245.784200 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 2013-12-31 00:00:00 | 1053.000000 | 2.371497 | 2496.722000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 2014-12-31 00:00:00 | 1181.000000 | 2.760504 | 2800.404800 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 2015-12-31 00:00:00 | 1822.000000 | 2.925345 | 3987.775400 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| std | NaN | 187.604346 | 0.415896 | 441.667311 | 0.469844 | 0.052271 | 0.052271 | 0.052271 | 0.052271 | 0.052271 |
CA_4_daily.shape
(1461, 14)
# Apply decomposition (Additive model for clearer seasonality)
decomposition = seasonal_decompose(CA_4_daily.set_index('date')['revenue'], model='additive', period=365)
# Plot decomposition results
plt.figure(figsize=(12, 8))
fig = decomposition.plot()
fig.axes[0].set_title("")
plt.suptitle(f"Seasonal Decomposition of Revenue - CA_4 Store", fontsize=14)
plt.show()
<Figure size 1200x800 with 0 Axes>
# Extract seasonal component
seasonality_strength = np.var(decomposition.seasonal) / np.var(CA_4_daily['revenue'])
# Display the seasonality strength
seasonality_strength
0.5383630531175937
Roughly 54% of the variance in CA_4 store revenue series is explained by seasonality, indicating a strong degree of seasonality¶
# Autocorrelation and Partial Autocorrelation plots
plt.figure(figsize=(12,4))
plot_acf(CA_4_daily['revenue'], lags=50, zero=False, ax=plt.gca()) # Autocorrelation from lag 1
plt.title('Autocorrelation of Revenue (starting from lag 1)')
plt.show()
plt.figure(figsize=(12,4))
plot_pacf(CA_4_daily['revenue'], lags=50, zero=False, ax=plt.gca()) # Partial Autocorrelation from lag 1
plt.title('Partial Autocorrelation of Revenue (starting from lag 1)')
plt.show()
PACF is p or AR, it is pretty obvious that p is 1
ACF is q or MA, it is not obvious, need auto ARIMA to further determine q
Begin Modeling¶
- Apply ADF test to check stationarity
# Automate Augmented Dickey-Fuller (ADF) test for stationarity
adf_test(CA_4_daily['revenue'])
Augmented Dickey-Fuller Test: ADF test statistic -3.215335 p-value 0.019114 # lags used 21.000000 # observations 1439.000000 critical value (1%) -3.434902 critical value (5%) -2.863551 critical value (10%) -2.567840 Strong evidence against the null hypothesis Reject the null hypothesis Data has no unit root and is stationary
Since it is a stationary data, we don't need to difference it to achieve stationary, skip the below differencing steps
# # Apply first-order differencing
# CA_4_daily['revenue_diff'] = CA_4_daily['revenue'].diff()
# # Perform ADF test again on the differenced series
# adf_test(CA_4_daily['revenue_diff'])
Augmented Dickey-Fuller Test: ADF test statistic -1.542777e+01 p-value 2.976135e-28 # lags used 2.600000e+01 # observations 1.913000e+03 critical value (1%) -3.433773e+00 critical value (5%) -2.863052e+00 critical value (10%) -2.567575e+00 Strong evidence against the null hypothesis Reject the null hypothesis Data has no unit root and is stationary
# # Plot the differenced revenue
# plt.figure(figsize=(12,6))
# plt.plot(CA_4_daily['date'], CA_4_daily['revenue_diff'], label='Differenced Revenue', color='purple')
# plt.xlabel('')
# plt.ylabel('Differenced Revenue')
# plt.title('First-Order Differencing of Revenue to Achieve Stationarity')
# plt.legend()
# plt.grid(True)
# plt.show()
- Use auto ARIMA to determine optimal parameters (p, d, q). Since p=1 from PACF, we force p=1 in auto ARIMA to reduce computing time
# Define the endogenous (target) variable and exogenous variables
y = CA_4_daily.set_index('date')['revenue']
exo = CA_4_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]
# Use Auto ARIMA to determine optimal SARIMAX parameters
auto_arima_model = pm.auto_arima(
y,
exogenous=exo,
stationary=True, # Tells auto_arima that the series is already stationary
seasonal=True,
m=7, # Assuming weekly seasonality
start_p=1, # Minimum AR order
max_p=1, # Maximum AR order (forces p = 1)
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True
)
# Extract the best parameters
optimal_order = auto_arima_model.order
optimal_seasonal_order = auto_arima_model.seasonal_order
optimal_order, optimal_seasonal_order
Performing stepwise search to minimize aic ARIMA(1,0,2)(1,0,1)[7] intercept : AIC=20999.389, Time=3.93 sec ARIMA(0,0,0)(0,0,0)[7] intercept : AIC=21945.745, Time=0.04 sec ARIMA(1,0,0)(1,0,0)[7] intercept : AIC=21094.341, Time=1.80 sec ARIMA(0,0,1)(0,0,1)[7] intercept : AIC=21274.526, Time=0.67 sec ARIMA(0,0,0)(0,0,0)[7] : AIC=27094.127, Time=0.02 sec ARIMA(1,0,2)(0,0,1)[7] intercept : AIC=21257.274, Time=1.16 sec ARIMA(1,0,2)(1,0,0)[7] intercept : AIC=21125.368, Time=2.68 sec ARIMA(1,0,2)(2,0,1)[7] intercept : AIC=inf, Time=4.00 sec ARIMA(1,0,2)(1,0,2)[7] intercept : AIC=inf, Time=4.85 sec ARIMA(1,0,2)(0,0,0)[7] intercept : AIC=21451.657, Time=0.34 sec ARIMA(1,0,2)(0,0,2)[7] intercept : AIC=21146.087, Time=2.17 sec ARIMA(1,0,2)(2,0,0)[7] intercept : AIC=20941.810, Time=3.92 sec ARIMA(0,0,2)(2,0,0)[7] intercept : AIC=21402.372, Time=3.36 sec ARIMA(1,0,1)(2,0,0)[7] intercept : AIC=21065.812, Time=3.81 sec ARIMA(1,0,3)(2,0,0)[7] intercept : AIC=20946.273, Time=3.89 sec ARIMA(0,0,1)(2,0,0)[7] intercept : AIC=21471.782, Time=2.71 sec ARIMA(0,0,3)(2,0,0)[7] intercept : AIC=21411.703, Time=3.79 sec ARIMA(1,0,2)(2,0,0)[7] : AIC=21083.673, Time=1.92 sec Best model: ARIMA(1,0,2)(2,0,0)[7] intercept Total fit time: 45.072 seconds
((1, 0, 2), (2, 0, 0, 7))
# Applying AUTO_ARIMA model to get p and q automatically
auto_arima_model.summary()
| Dep. Variable: | y | No. Observations: | 1461 |
|---|---|---|---|
| Model: | SARIMAX(1, 0, 2)x(2, 0, [], 7) | Log Likelihood | -10463.905 |
| Date: | Wed, 12 Mar 2025 | AIC | 20941.810 |
| Time: | 21:06:11 | BIC | 20978.818 |
| Sample: | 01-01-2012 | HQIC | 20955.615 |
| - 12-31-2015 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 1054.5207 | 178.457 | 5.909 | 0.000 | 704.752 | 1404.290 |
| ar.L1 | -0.4687 | 0.240 | -1.951 | 0.051 | -0.940 | 0.002 |
| ma.L1 | 0.7601 | 0.237 | 3.204 | 0.001 | 0.295 | 1.225 |
| ma.L2 | 0.2058 | 0.059 | 3.490 | 0.000 | 0.090 | 0.321 |
| ar.S.L7 | 0.3668 | 0.025 | 14.580 | 0.000 | 0.317 | 0.416 |
| ar.S.L14 | 0.3505 | 0.019 | 18.499 | 0.000 | 0.313 | 0.388 |
| sigma2 | 9.615e+04 | 1777.793 | 54.083 | 0.000 | 9.27e+04 | 9.96e+04 |
| Ljung-Box (L1) (Q): | 0.27 | Jarque-Bera (JB): | 3203.75 |
|---|---|---|---|
| Prob(Q): | 0.60 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 1.26 | Skew: | -0.51 |
| Prob(H) (two-sided): | 0.01 | Kurtosis: | 10.18 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
- Apply auto ARIMA results to fit SARIMAX model
# Fit the SARIMAX model
sarimax_model = SARIMAX(y,
exog=exo,
order=optimal_order,
seasonal_order=optimal_seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
sarimax_results = sarimax_model.fit()
# Display model summary
sarimax_results.summary()
| Dep. Variable: | revenue | No. Observations: | 1461 |
|---|---|---|---|
| Model: | SARIMAX(1, 0, 2)x(2, 0, [], 7) | Log Likelihood | -10808.235 |
| Date: | Wed, 12 Mar 2025 | AIC | 21638.469 |
| Time: | 21:06:19 | BIC | 21696.512 |
| Sample: | 01-01-2012 | HQIC | 21660.132 |
| - 12-31-2015 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| event1_Christmas | -0.0001 | 59.112 | -2.35e-06 | 1.000 | -115.858 | 115.857 |
| event1_Thanksgiving | 1845.9065 | 58.709 | 31.442 | 0.000 | 1730.839 | 1960.974 |
| event1_NewYear | 1672.3256 | 98.250 | 17.021 | 0.000 | 1479.758 | 1864.893 |
| event1_SuperBowl | 3269.5215 | 58.107 | 56.267 | 0.000 | 3155.634 | 3383.409 |
| event1_LaborDay | 3200.6348 | 51.379 | 62.294 | 0.000 | 3099.933 | 3301.337 |
| ar.L1 | 0.9999 | 0.000 | 2246.155 | 0.000 | 0.999 | 1.001 |
| ma.L1 | -0.8346 | 0.033 | -25.069 | 0.000 | -0.900 | -0.769 |
| ma.L2 | -0.1393 | 0.035 | -4.001 | 0.000 | -0.207 | -0.071 |
| ar.S.L7 | 0.2917 | 0.024 | 11.993 | 0.000 | 0.244 | 0.339 |
| ar.S.L14 | 0.2332 | 0.028 | 8.420 | 0.000 | 0.179 | 0.287 |
| sigma2 | 2.412e+05 | 0.934 | 2.58e+05 | 0.000 | 2.41e+05 | 2.41e+05 |
| Ljung-Box (L1) (Q): | 0.03 | Jarque-Bera (JB): | 13698.30 |
|---|---|---|---|
| Prob(Q): | 0.86 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 1.02 | Skew: | -2.09 |
| Prob(H) (two-sided): | 0.85 | Kurtosis: | 17.49 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 5.12e+18. Standard errors may be unstable.
# Plot diagnostic plots (residuals, etc.)
sarimax_results.plot_diagnostics(figsize=(15, 12))
plt.show()
- Plot actual and predicted trend
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Actual', marker='o')
plt.plot(y.index, sarimax_results.fittedvalues, label='Fitted', marker='x', color='red')
plt.title('Actual vs Fitted Values')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.legend()
plt.show()
- Evaluate predicted result using MAPE
# Actual values (endogenous variable)
y_true = y # or test set if we're evaluating out-of-sample
# In-sample fitted values from SARIMAX
y_pred = sarimax_results.fittedvalues
# Calculate MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f'MAPE: {mape:.2f}%')
MAPE: inf%
mape_value = mean_absolute_percentage_error(
y_true,
y_pred
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 2760581763077460992.00%
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
)
print(evaluation_results)
Mean Squared Error (MSE): 177624.62 Root Mean Squared Error (RMSE): 421.46 Mean Absolute Error (MAE): 266.63
Model Evaluation¶
- Train-Test Validation
# Split data into train and test sets (80-20 split)
train_size = int(len(y) * 0.8)
train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]
# Refit SARIMAX model on training data
sarimax_model_train = SARIMAX(train_y,
exog=train_exo,
order=optimal_order,
seasonal_order=optimal_seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
sarimax_results_train = sarimax_model_train.fit()
# Forecast on test set
CA_4_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)
- Check Distribution of MAPE
# Extract the actual and forecast values from the DataFrame
actual_values = test_y.values
forecast_values = CA_4_predictions
# Ensure both are one-dimensional arrays
actual_values = np.squeeze(actual_values)
forecast_values = np.squeeze(forecast_values)
# Calculate the MAPE for each row
mape_row_by_row = np.abs((actual_values - forecast_values) / actual_values) * 100
# Create a DataFrame to display the row-by-row MAPE along with the actual and forecast values
df_results = pd.DataFrame({
'Actual': actual_values,
'Forecast': forecast_values,
'MAPE': mape_row_by_row
})
# Calculate the overall weighted MAPE
weights = actual_values / actual_values.sum()
overall_mape = np.sum(mape_row_by_row * weights)
# Display the DataFrame
print(df_results)
# Print the overall MAPE
print(f"Overall MAPE: {overall_mape:.2f}%")
Actual Forecast MAPE 2015-03-14 2954.2010 2716.140032 8.058388 2015-03-15 3434.8767 3018.978510 12.108097 2015-03-16 2702.6387 2670.007773 1.207373 2015-03-17 2529.3633 2539.704017 0.408827 2015-03-18 2360.2935 2541.756934 7.688172 ... ... ... ... 2015-12-27 1891.2354 2714.121739 43.510519 2015-12-28 2110.1110 2714.346296 28.635237 2015-12-29 1747.9710 2714.570874 55.298393 2015-12-30 1981.7717 2714.795473 36.988306 2015-12-31 2814.8704 2715.020093 3.547244 [293 rows x 3 columns] Overall MAPE: 12.87%
- Check Overestimate/Underestimate Forecasting & Predicted/Actual Pattern Match
# Plot predicted results
plt.figure(figsize=(12, 6))
plt.plot(test_y.index, test_y, label="Actual Revenue", color='blue', linewidth=2)
plt.plot(test_y.index, CA_4_predictions, label="Predicted Revenue", color='red', linestyle="dashed", linewidth=2)
plt.xlabel("")
plt.ylabel("Revenue")
plt.title("CA_4 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events")
plt.legend()
plt.grid(True)
plt.show()
- Check Evaluation Metrics
# Compute evaluation metrics
mse = mean_squared_error(test_y, CA_4_predictions)
rmse = np.sqrt(mean_squared_error(test_y, CA_4_predictions))
mae = mean_absolute_error(test_y, CA_4_predictions)
mape = np.mean(np.abs((test_y - CA_4_predictions) / test_y)) * 100
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)
print(evaluation_results)
Mean Squared Error (MSE): 224556.91 Root Mean Squared Error (RMSE): 473.87 Mean Absolute Error (MAE): 339.80 Mean Absolute Percentage Error (MAPE): inf%
mape_value = mean_absolute_percentage_error(
test_y,
CA_4_predictions
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 4171090284982807040.00%
Comparison of CA_4 Forecast Plots with Evaluation Metrics¶
| CA_4 Top_Down Forecast | CA_4 Direct Forecast |
|---|---|
![]() MSE: 193354 RMSE: 440 MAE: 291 MAPE: infinite |
![]() MSE: 224557 RMSE: 474 MAE: 340 MAPE: infinite |
Conclusion: Top-down approach is recommended in CA_4 Store Revenue Forecast
4. Bottom-Up Approach¶
# # Align them in a single DataFrame
# df_forecasts = pd.DataFrame({
# 'CA_1': CA_1_predictions,
# 'CA_2': CA_2_predictions,
# 'CA_3': CA_3_predictions,
# 'CA_4': CA_4_predictions
# })
# # Sum across the columns to get the state forecast
# df_forecasts['State_Forecast'] = df_forecasts.sum(axis=1)
# # Now you have a time series for the state-level revenue
# state_forecast = df_forecasts['State_Forecast']
- Calculate Foods3 Predicted Revenue
# Get Foods3 department revenue forecast
forecast_Foods3 = CA_1_predictions + CA_2_predictions + CA_3_predictions + CA_4_predictions
forecast_Foods3
2015-03-14 20265.332966
2015-03-15 22190.123335
2015-03-16 17192.977805
2015-03-17 15880.629181
2015-03-18 14969.381244
...
2015-12-27 14760.474069
2015-12-28 12691.721352
2015-12-29 12209.749582
2015-12-30 11789.361566
2015-12-31 12041.423372
Freq: D, Name: predicted_mean, Length: 293, dtype: float64
- Get Foods3 Actual Revenue
# 1. Copy for Foods3 department
department_daily.copy()
# 2. Restrict to desired date range (2015-03-14 to 2015-12-31)
start_date = '2015-03-14'
end_date = '2015-12-31'
mask = (department_daily['date'] >= start_date) & (department_daily['date'] <= end_date)
Foods3_filtered = department_daily.loc[mask]
# 3. Select only 'date' and 'revenue', then sort by date
actual_Foods3 = Foods3_filtered[['date', 'revenue']].sort_values('date')
# 4. Set date column as index
actual_Foods3.set_index('date', inplace=True)
# 5. Print time series of actual revenue
actual_Foods3
| revenue | |
|---|---|
| date | |
| 2015-03-14 | 20577.9210 |
| 2015-03-15 | 24262.6775 |
| 2015-03-16 | 16995.1786 |
| 2015-03-17 | 14918.4433 |
| 2015-03-18 | 13953.3844 |
| ... | ... |
| 2015-12-27 | 16427.9190 |
| 2015-12-28 | 15311.9904 |
| 2015-12-29 | 14675.1821 |
| 2015-12-30 | 16201.4307 |
| 2015-12-31 | 20203.9008 |
293 rows × 1 columns
- Plot Foods3 Actual VS Forecast
# Plot actual VS forecast
plt.figure(figsize=(10, 6))
plt.plot(actual_Foods3.index, actual_Foods3, label='Actual Foods3 Revenue')
plt.plot(forecast_Foods3.index, forecast_Foods3, label='Forecast Foods3 Revenue')
# Add labels and legend
plt.xlabel('')
plt.ylabel('Revenue')
plt.title('State Level (CA) Foods3 Department Revenue (2015/3-2015/12) Forecasting with 5 Major Events')
plt.legend()
# Show the plot
plt.show()
- Evaluate Foods3 Bottom-Up Forecast Result
# Compute evaluation metrics
mse = mean_squared_error(actual_Foods3, forecast_Foods3)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_Foods3, forecast_Foods3)
mape = np.mean(np.abs((actual_Foods3 - forecast_Foods3) / actual_Foods3)) * 100
# Display results
evaluation_results = (
f"Mean Squared Error (MSE): {mse:.2f}\n"
f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
f"Mean Absolute Error (MAE): {mae:.2f}\n"
f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)
print(evaluation_results)
Mean Squared Error (MSE): 17542113.09 Root Mean Squared Error (RMSE): 4188.33 Mean Absolute Error (MAE): 3363.54 Mean Absolute Percentage Error (MAPE): nan%
mape_value = mean_absolute_percentage_error(
actual_Foods3,
forecast_Foods3
) * 100 # sklearn version returns a decimal, so multiply by 100
print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 317.61%
Comparison of Foods3 Forecast Plots with Evaluation Metrics¶
| Foods3 Bottom_Up Forecast | Foods3 Direct Forecast |
|---|---|
![]() MSE: 17542113 RMSE: 4188 MAE: 3364 MAPE: 318% |
![]() MSE: 6068120 RMSE: 2463 MAE: 1553 MAPE: 526% |
Conclusion: Bottom-up approach is not recommended in Foods3 Department Revenue Forecast
Summary¶
| Store/Department | Forecast Recommendation |
|---|---|
| Foods3 Department | Directly Forecast at Department Level is Recommended |
| CA_1 Store | Forecast at Departmental Level & Top-Down Approach is Recommended |
| CA_2 Store | Forecast with ML & DL Model is Recommended |
| CA_3 Store | Forecast at Departmental Level & Top-Down Approach is Recommended |
| CA_4 Store | Forecast at Departmental Level & Top-Down Approach is Recommended |
Directly Forecast Department Revenue and Using Top-Down Approach to Get Each Store Revenue is Highly Recommended in This Business Case
Why?¶
Why Top-down Approach is Recommended:¶
Capture the 55% seasonal strength and reduce noise like zero-revenue days and daily fluctuation like CA_2 in individual store data
It is using one model:
Save computational time and cost
Easier to maintain and debug
Lower likelihood of overfitting
Why Bottom-up Approach is Not Recommended:¶
Lead to misaligned inventory or staffing decisions if store-specific errors compound at the department level
It is using many models:
Cost abundant resources and lower speed
Difficult to maintain and deploy in production
Hard to explain to clients
Why Hybrid Approach with GBRT, LSTM, and SARIMAX is Better:¶
- Given our data’s characteristics:
55% strong seasonality
4 zero-revenue days
5 event exogenous variables
Overall non-linear trends
In the above situation, ML model (GBRT) or DL model (LSTM) could reduce errors and improve accuracy; However, SARIMAX is preferred for easy interpretability and faster speed to get store forecasts with top-down approach;
Therefore, a hybrid approach that combines the strengths of all 3 methods could be the most effective solution









